DotNetSlackers: ASP.NET News for lazy Developers

Friday, January 7, 2011

SQLite for C# – Part 9 – Creating native extension library for data quality testing


SQLite Extension libraries are a handy solution for users. These are dynamically loadable extensions that can be leveraged from either the command line (sqlite3[.exe]) or from within the linked in code. This means that to add, or use, functionality not already built into SQLite is now fairly simple and open to the masses. Unfortunately, this is not a well publicized feature :-(.
As an example, I was recently using SQLite to check some Data Quality (DQ) results that have been created from an outside group leveraging a commercial DQ tool I didn’t immediately have access to. The implementation I used to check the results obviously is extremely lightweight compared to the real dedicated DQ toolset, however it was a handy sanity check (and is more of a proof of concept of loadable extensions than anything DQ production worthy!).
In this case, I was checking the contents of certain fields in 3 CSV files. Since some of the checks spanned across ‘tables’, I loaded them into SQLite as a starting place. I would also point out that here the typeless nature of SQLite came into its own – I didn’t need to figure out the length of any column and could guess at the format with the assurance that data wouldn’t be lost by the database (try that in any other DB!).
Most of the checks I needed to do could be done with standard SQL. Examples included length checks, do the fields in one ‘table’ match the code control ‘table’, is one date always greater than another date. All this good normal stuff was trivially handled using base SQL. However one check was a little more tricky that gave me the excuse to try the extension library approach!
I needed to know if a text field contained all numeric values. My first attempt was to multiple the field by one, and compare the length. For example:
 Collapse
SELECT sample_c from SAMPLEDATA WHERE sample_c*1=sample_c
On the surface, this works multiplying the text “123? by 1 gives 123, while multiplying “123F2? * 1 gives 123. So the multiplication approach appears to work (at least at the high level – there are more than several issues with this approach)! However this really all falls apart when dealing with strings with leading zeros. After encountering this type of scenario, you can quickly start other issues. What we need is a function to look at the string and tell us the format of the string. While I could recompile SQLite to add in a new function, that seemed more than a little heavy handed, the ‘extension’ method seemed to fit the requirement perfectly.
Leveraging the loadable extension capabilities of SQLite, I’m going to add the following functions:
  • PATTERN – Looks at the data element and generates a matching pattern, more on this later
  • IMPLIEDTYPE – Looks at a column (in aggregate) and suggests the best type for it
So the PATTERN function looks to see if the information is a digit, alpha (a-z) or other. All digits are represented as “9?, and all alphas are “X”, and anything else is left as the original. So a date of “1900-01-01? would have a pattern of “9999-99-99?, and an amount “12.99? would be “99.99?, and a text code of “NAME” would become “XXXX”. This is a lightweight implementation of a PATTERN function that can help in quickly looking at patterns in the text data to see if there are any data elements that do not follow the expected pattern. A classic example for an untyped text file would be determining if the dates are all 9999-99-99, or if there are any 99/99/9999 values in there.
Generating this type of function requires the following construct:
 Collapse
/*
** The sampleFUNCTION() SQL function returns the implied pattern 
** for the information contents.
*/
static void sampleFunc(
sqlite3_context *context,
int argc,
sqlite3_value **argv
)
{

/* Get length of input parameter */
int n = sqlite3_value_bytes(argv[0]);

/* Do your custom work for the function*/

/* Return back the results of the information */
sqlite3_result_text(context, zPattern, i, SQLITE_TRANSIENT);
}
In essence, we will get some parameters passed in (accessed via “sqlite3_value_*” functions) and we will return the result back. The type of the results that are returned are defined by (documentation link):
  • sqlite3_result_text sqlite3_result_text16 sqlite3_result_text16le sqlite3_result_text16be
  • sqlite3_result_double
  • sqlite3_result_int sqlite3_result_int64
  • sqlite3_result_value
  • sqlite3_result_blob sqlite3_result_zeroblob
  • sqlite3_result_null
  • sqlite3_result_error sqlite3_result_error16 sqlite3_result_error_toobig sqlite3_result_error_nomem /sqlite3_result_error_code
The simplest example implementation of a loadable extension library can be found in the SQLite WIKI here, for simplicity it is included below:
 Collapse
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1

/*
** The half() SQL function returns half of its input value.
*/
static void halfFunc(
  sqlite3_context *context,
  int argc,
  sqlite3_value **argv
){
  sqlite3_result_double(context, 0.5*sqlite3_value_double(argv[0]));
}

/* SQLite invokes this routine once when it loads the extension.
** Create new functions, collating sequences, and virtual table
** modules here.  This is usually the only exported symbol in
** the shared library.
*/
int sqlite3_extension_init(
  sqlite3 *db,
  char **pzErrMsg,
  const sqlite3_api_routines *pApi
){
  SQLITE_EXTENSION_INIT2(pApi)
  sqlite3_create_function(db, "half", 1, SQLITE_ANY, 0, halfFunc, 0, 0);
  return 0;
}
While this is an excellent starter (and I recommend reading the whole link – it's short and to the point) it is a little lightweight for anyone looking to do anything serious, but on the other hand it is probably the shortest working example that can get you started! Unfortunately, I have to say that this information is not easily found on the SQLite website, and after starting with the above, much of the code used for the DQ Proof of concept was achieved searching SQLite source code to get to the end example. Hence the reason for the article!
So the resulting “PATTERN” function ended up like:
 Collapse
/*
** The PATTERN() SQL function returns the implied pattern for the information contents.
*/
static void patternFunc(
  sqlite3_context *context,
  int argc,
  sqlite3_value **argv
)
{
   // General counter
   int i;
   char detectedType;

   /* allocate buffer to handle pattern */
   char zPattern[200];
   const unsigned char *pInputValue = sqlite3_value_text(argv[0]);

   for(i=0; i<sizeof(zPattern) - 1 && pInputValue[i]; i++)
   {
      if (pInputValue[i] >= '0' && pInputValue[i] <= '9')
      {
         detectedType='9';
      }
      else
      if ( (pInputValue[i] >= 'a' && pInputValue[i] <= 'z') ||
           (pInputValue[i] >= 'A' && pInputValue[i] <= 'Z')
         )
      {
         detectedType='X';
      }
      else
      {
         detectedType=pInputValue[i];
      }

      zPattern[i] = detectedType;
   }

   zPattern[i] = 0;

   sqlite3_result_text(context, zPattern, i, SQLITE_TRANSIENT);
}

/* SQLite invokes this routine once when it loads the extension.
** Create new functions, collating sequences, and virtual table
** modules here.  This is usually the only exported symbol in
** the shared library.
*/
int sqlite3_extension_init(
  sqlite3 *db,
  char **pzErrMsg,
  const sqlite3_api_routines *pApi
)
{
  SQLITE_EXTENSION_INIT2(pApi)

  /* DB, Name, #args, eTextRep, pArg, Function,*/
  /* Create the function to be applied */
  sqlite3_create_function(db, "PATTERN", 1, SQLITE_ANY, 0, patternFunc, 0, 0);

  return 0;
}
So PATTERN is a simple dynamically loaded FUNCTION stored in a DLL (or loadable library). To test this, you firstly need to “.load” the dynamic library into the SQLite3.exe then run the function. Below is a working example taken from the run.sql example file in the download zip that loads the library then executes the PATTERN function from the library.
 Collapse
.load Profiler.sqlext
SELECT
  value_c || '-->' || PATTERN(value_c)
 ,value_i || '-->' || PATTERN(value_i)
 ,value_d || '-->' || PATTERN(value_d)
FROM
 SAMPLEDATA;
Another example of a loadable FUNCTION is an AGGREGATE function, this gets called over a set of data to return a result based of the results of the set. The most classic SQL example of this would be SUM. In this DQ case, I wanted to write a function to suggest a type for a column based off the data contents. For this, we would have to spin through all the matching rows determine its type then choose the best data type match. In the example below, the type of column “value_c” is determined from the table “SAMPLEDATA”, the second example is over a subset of values.
 Collapse
select IMPLIEDTYPE(value_c) from SAMPLEDATA;
select IMPLIEDTYPE(value_c) from SAMPLEDATA
   where value_c in ("123", "123.38");
My first thought was to wonder if I could do this using the standard per row FUNCTION, then aggregating & ordering the results of each function byMAX COUNT or some other text based function. However this became tricky when dealing with things like “NUMERIC(14,2)” & “NUMERIC(15,1)”, attempting to use text based rules on this would have just been wrong – so enter the AGGREGATE FUNCTION!
Aggregate functions allocate working memory using the SQLite3 “sqlite3_aggregate_context” function, and after this, the function can then use this context/scratch area to keep any statistics, averages or trees to help the processing of the data rows through the aggregate function. In this case, I’m keeping the type, precision & scale of the encountered data elements. Rather than copying out the code for the function in the article, you can find it in ProfilerExtension.c.
In addition to IMPLIEDTYPE this I added IMPLIEDTYPEXML that generates an XML snippet that helps show how the IMPLIEDTYPE function got to the answer. The IMPLIEDTYPEXML function returned an “XML’ized” version of the aggregate function storage area so you can see the various counts of the types.
 Collapse
<ImpliedType Type="NUMERIC(5,2)"
  int_count="1" int_min="123" int_max="123"
  num_count="1" num_min_integral_len="3"
  num_max_integral_len="3"
  num_min_scale_len="0" num_max_scale_len="2"
  string_count="0" string_min_len="3" string_max_len="6" />
Here we have processed one (1) int, one (1) numeric value and no strings. For clarification, a numeric value has a decimal point in it to differentiate from an int, and the string is a collection of characters that is not a int or a numeric! We use the int min and max to determine if the integer type should be a byteintsmallintinteger, or bigint – or even numeric. Finally after all the aggregation/processing is done, the finalizer function (in this case “impliedTypeFuncFinalize”) is called and it looks at the various stored values in the context block and uses the following logic to suggest the “implied type”.
  1. If any strings detected, then it is a string else
  2. If any numerics detected, then it is numeric else
  3. If any integers detected, then
    1. If max < 128 & min > -127 then byteint
    2. If max < 32767 & min > -32768 then smallint
    3. If max < 2147483647 & min > -2147483646 then integer
    4. If max < 9223372036854775807 & min > -9223372036854775806 then bigint
    5. else numeric
  4. Else “UNKNOWN
For mostly academic interest, I used “sqlite3_int64? types for counting the rows and tracking the int min and max numbers, the count is unlikely to be practical for most databases – although the min/max would have practical applications for bigint detection.
As mentioned before, this is meant to be used as a practical proof of concept only and I would not deem it even Beta quality without a quality test plan and thorough code review. However it should serve its purpose as a starting point for those looking to extend SQLite, and even as a throwaway concept showing how DQ could be embedded into a SQLite engine via the loadable extensions.
Hopefully, this helps with those looking to write or leverage the loadable extension functions that are readily available in SQLite. With the above help, the source code should make sense, and you can get on your way writing your own extensions.

No comments:

Post a Comment