SQLite Extension libraries are a handy solution for users. These are dynamically loadable extensions that can be leveraged from either the command line (
sqlite3[.exe]
) or from within the linked in code. This means that to add, or use, functionality not already built into SQLite is now fairly simple and open to the masses. Unfortunately, this is not a well publicized feature .As an example, I was recently using SQLite to check some Data Quality (DQ) results that have been created from an outside group leveraging a commercial DQ tool I didn’t immediately have access to. The implementation I used to check the results obviously is extremely lightweight compared to the real dedicated DQ toolset, however it was a handy sanity check (and is more of a proof of concept of loadable extensions than anything DQ production worthy!).
In this case, I was checking the contents of certain fields in 3 CSV files. Since some of the checks spanned across ‘tables’, I loaded them into SQLite as a starting place. I would also point out that here the typeless nature of SQLite came into its own – I didn’t need to figure out the length of any column and could guess at the format with the assurance that data wouldn’t be lost by the database (try that in any other DB!).
Most of the checks I needed to do could be done with standard SQL. Examples included length checks, do the fields in one ‘table’ match the code control ‘table’, is one date always greater than another date. All this good normal stuff was trivially handled using base SQL. However one check was a little more tricky that gave me the excuse to try the extension library approach!
I needed to know if a text field contained all numeric values. My first attempt was to multiple the field by one, and compare the length. For example:
Collapse
SELECT sample_c from SAMPLEDATA WHERE sample_c*1=sample_c
On the surface, this works multiplying the text “123? by 1 gives 123, while multiplying “123F2? * 1 gives 123. So the multiplication approach appears to work (at least at the high level – there are more than several issues with this approach)! However this really all falls apart when dealing with
string
s with leading zeros. After encountering this type of scenario, you can quickly start other issues. What we need is a function to look at the string
and tell us the format of the string
. While I could recompile SQLite to add in a new function, that seemed more than a little heavy handed, the ‘extension’ method seemed to fit the requirement perfectly.Leveraging the loadable extension capabilities of SQLite, I’m going to add the following functions:
PATTERN
– Looks at the data element and generates a matching pattern, more on this laterIMPLIEDTYPE
– Looks at a column (in aggregate) and suggests the best type for it
So the
PATTERN
function looks to see if the information is a digit, alpha (a-z) or other. All digits are represented as “9
?, and all alphas are “X
”, and anything else is left as the original. So a date of “1900-01-01
? would have a pattern of “9999-99-99
?, and an amount “12.99
? would be “99.99
?, and a text code of “NAME
” would become “XXXX
”. This is a lightweight implementation of a PATTERN
function that can help in quickly looking at patterns in the text data to see if there are any data elements that do not follow the expected pattern. A classic example for an untyped text file would be determining if the dates are all 9999-99-99
, or if there are any 99/99/9999
values in there.Generating this type of function requires the following construct:
Collapse
/*
** The sampleFUNCTION() SQL function returns the implied pattern
** for the information contents.
*/
static void sampleFunc(
sqlite3_context *context,
int argc,
sqlite3_value **argv
)
{
/* Get length of input parameter */
int n = sqlite3_value_bytes(argv[0]);
/* Do your custom work for the function*/
/* Return back the results of the information */
sqlite3_result_text(context, zPattern, i, SQLITE_TRANSIENT);
}
In essence, we will get some parameters passed in (accessed via “
sqlite3_value_*
” functions) and we will return the result back. The type of the results that are returned are defined by (documentation link):sqlite3_result_text
/sqlite3_result_text16
/sqlite3_result_text16le
/sqlite3_result_text16be
sqlite3_result_double
sqlite3_result_int
/sqlite3_result_int64
sqlite3_result_value
sqlite3_result_blob
/sqlite3_result_zeroblob
sqlite3_result_null
sqlite3_result_error
/sqlite3_result_error16
/sqlite3_result_error_toobig
/sqlite3_result_error_nomem
/sqlite3_result_error_code
The simplest example implementation of a loadable extension library can be found in the SQLite WIKI here, for simplicity it is included below:
Collapse
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1
/*
** The half() SQL function returns half of its input value.
*/
static void halfFunc(
sqlite3_context *context,
int argc,
sqlite3_value **argv
){
sqlite3_result_double(context, 0.5*sqlite3_value_double(argv[0]));
}
/* SQLite invokes this routine once when it loads the extension.
** Create new functions, collating sequences, and virtual table
** modules here. This is usually the only exported symbol in
** the shared library.
*/
int sqlite3_extension_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
){
SQLITE_EXTENSION_INIT2(pApi)
sqlite3_create_function(db, "half", 1, SQLITE_ANY, 0, halfFunc, 0, 0);
return 0;
}
While this is an excellent starter (and I recommend reading the whole link – it's short and to the point) it is a little lightweight for anyone looking to do anything serious, but on the other hand it is probably the shortest working example that can get you started! Unfortunately, I have to say that this information is not easily found on the SQLite website, and after starting with the above, much of the code used for the DQ Proof of concept was achieved searching SQLite source code to get to the end example. Hence the reason for the article!
So the resulting “
PATTERN
” function ended up like: Collapse
/*
** The PATTERN() SQL function returns the implied pattern for the information contents.
*/
static void patternFunc(
sqlite3_context *context,
int argc,
sqlite3_value **argv
)
{
// General counter
int i;
char detectedType;
/* allocate buffer to handle pattern */
char zPattern[200];
const unsigned char *pInputValue = sqlite3_value_text(argv[0]);
for(i=0; i<sizeof(zPattern) - 1 && pInputValue[i]; i++)
{
if (pInputValue[i] >= '0' && pInputValue[i] <= '9')
{
detectedType='9';
}
else
if ( (pInputValue[i] >= 'a' && pInputValue[i] <= 'z') ||
(pInputValue[i] >= 'A' && pInputValue[i] <= 'Z')
)
{
detectedType='X';
}
else
{
detectedType=pInputValue[i];
}
zPattern[i] = detectedType;
}
zPattern[i] = 0;
sqlite3_result_text(context, zPattern, i, SQLITE_TRANSIENT);
}
/* SQLite invokes this routine once when it loads the extension.
** Create new functions, collating sequences, and virtual table
** modules here. This is usually the only exported symbol in
** the shared library.
*/
int sqlite3_extension_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
)
{
SQLITE_EXTENSION_INIT2(pApi)
/* DB, Name, #args, eTextRep, pArg, Function,*/
/* Create the function to be applied */
sqlite3_create_function(db, "PATTERN", 1, SQLITE_ANY, 0, patternFunc, 0, 0);
return 0;
}
So
PATTERN
is a simple dynamically loaded FUNCTION
stored in a DLL (or loadable library). To test this, you firstly need to “.load
” the dynamic library into the SQLite3.exe then run the function. Below is a working example taken from the run.sql example file in the download zip that loads the library then executes the PATTERN
function from the library. Collapse
.load Profiler.sqlext
SELECT
value_c || '-->' || PATTERN(value_c)
,value_i || '-->' || PATTERN(value_i)
,value_d || '-->' || PATTERN(value_d)
FROM
SAMPLEDATA;
Another example of a loadable
FUNCTION
is an AGGREGATE
function, this gets called over a set of data to return a result based of the results of the set. The most classic SQL example of this would be SUM
. In this DQ case, I wanted to write a function to suggest a type for a column based off the data contents. For this, we would have to spin through all the matching rows determine its type then choose the best data type match. In the example below, the type of column “value_c
” is determined from the table “SAMPLEDATA
”, the second example is over a subset of values. Collapse
select IMPLIEDTYPE(value_c) from SAMPLEDATA;
select IMPLIEDTYPE(value_c) from SAMPLEDATA
where value_c in ("123", "123.38");
My first thought was to wonder if I could do this using the standard per row
FUNCTION
, then aggregating & ordering the results of each function byMAX COUNT
or some other text based function. However this became tricky when dealing with things like “NUMERIC(14,2)
” & “NUMERIC(15,1)
”, attempting to use text based rules on this would have just been wrong – so enter the AGGREGATE FUNCTION!Aggregate functions allocate working memory using the SQLite3 “sqlite3_aggregate_context” function, and after this, the function can then use this context/scratch area to keep any statistics, averages or trees to help the processing of the data rows through the aggregate function. In this case, I’m keeping the type, precision & scale of the encountered data elements. Rather than copying out the code for the function in the article, you can find it in ProfilerExtension.c.
In addition to
IMPLIEDTYPE
this I added IMPLIEDTYPEXML
that generates an XML snippet that helps show how the IMPLIEDTYPE
function got to the answer. The IMPLIEDTYPEXML
function returned an “XML’ized” version of the aggregate function storage area so you can see the various counts of the types. Collapse
<ImpliedType Type="NUMERIC(5,2)"
int_count="1" int_min="123" int_max="123"
num_count="1" num_min_integral_len="3"
num_max_integral_len="3"
num_min_scale_len="0" num_max_scale_len="2"
string_count="0" string_min_len="3" string_max_len="6" />
Here we have processed one (1)
int
, one (1) numeric value and no string
s. For clarification, a numeric value has a decimal point in it to differentiate from an int
, and the string
is a collection of characters that is not a int
or a numeric! We use the int
min and max to determine if the integer type should be a byteint
, smallint
, integer
, or bigint
– or even numeric. Finally after all the aggregation/processing is done, the finalizer function (in this case “impliedTypeFuncFinalize
”) is called and it looks at the various stored values in the context block and uses the following logic to suggest the “implied type”.- If any
string
s detected, then it is astring
else - If any
numeric
s detected, then it isnumeric
else - If any integers detected, then
- If max < 128 & min > -127 then
byteint
- If max < 32767 & min > -32768 then
smallint
- If max < 2147483647 & min > -2147483646 then
integer
- If max < 9223372036854775807 & min > -9223372036854775806 then
bigint
- else
numeric
- If max < 128 & min > -127 then
- Else “
UNKNOWN
”
For mostly academic interest, I used “
sqlite3_int64
? types for counting the rows and tracking the int
min and max numbers, the count is unlikely to be practical for most databases – although the min/max would have practical applications for bigint
detection.As mentioned before, this is meant to be used as a practical proof of concept only and I would not deem it even Beta quality without a quality test plan and thorough code review. However it should serve its purpose as a starting point for those looking to extend SQLite, and even as a throwaway concept showing how DQ could be embedded into a SQLite engine via the loadable extensions.
Hopefully, this helps with those looking to write or leverage the loadable extension functions that are readily available in SQLite. With the above help, the source code should make sense, and you can get on your way writing your own extensions.
No comments:
Post a Comment