The SQVID config file¶
Every SQVID config file has basically two components: the [general]
section and all the other sections that describe the configuration of
respective validations.
The [general]
section¶
This section is generally used to set up parameters that affect all validations that are defined in a single SQVID config file.
A sample [general]
section may look as follows:
[general]
sqla = "sqlite:///test_sqvid_db.sqlite"
db_name = 'test_sqvid_db'
sqla
This parameter sets the so called Database URL that SQLAlchemy uses to connect to the database in which the data to be validated is located.
db_name
The name of the database in which the data to be validated is located. Although some connection engines would already have this specified in the
sqla
parameter, having it specified separately allows us much greater flexibility when generating validations.limit
By default,
sqvid
will report all rows that did not conform to a specific validation. This may not be desirable in all cases (i.e. when the expected amount of non-conforming rows is large)The limit can be specified on per validation config basis using the
limit
parameter, such as in the following:[general] sqla = "sqlite:///test_sqvid_db.sqlite" db_name = 'test_sqvid_db' limit = 50
Definition of validations¶
Using the db_name
mentioned above, validations are defined as arrays of
TOML tables in the format of $db_name.$table_name.$column_name
. Here
is a simple example:
[[test_sqvid_db.suppliers.SupplierID]]
validator = 'unique'
As we can see the $db_name
would in this case be test_sqvid_db
, the
suppliers
would be the $table_name
and SupplierID
the
$column_name
.
Note
the double brackets (that is [[
and ]]
) around
test_sqvid_db.suppliers.SupplierID
denote “arrays of tables”. It
means that test_sqvid_db.suppliers.SupplierID
will not be a single
table but rather an array. Among other things this allows us to define
various validations for a single column. Here is a quick example:
[[test_sqvid_db.suppliers.SupplierID]]
validator = 'unique'
[[test_sqvid_db.suppliers.SupplierID]]
validator = 'in_range'
args = {min = 1, max = 256}
This example defines two validations on the
test_sqvid_db.suppliers.SupplierID
column: one with the unique
validator and one with the in_range
validator.
Once the database, table and column in which the data we want to validate is define, we can specify which validator to use and with what parameters. This can be done using he following set of parameters:
validator
This (string) parameter needs to contain one of the validators described in Validators. For example:
[[test_sqvid_db.suppliers.SupplierID]] validator = 'unique'
args
Arguments to be passed to the specified validator. They expressed as a TOML table and TOML supports various ways of describing those. All of the following definitions are functionally equal:
[[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' args = {min = 1, max = 256} # is equivalent to [[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' args.min = 1 args.max = 256 # is again equivalen to [[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' [test_sqvid_db.suppliers.SupplierID.args] min = 1 max = 256
For short arguments (and short values) the first way is preferable. In all other cases the second one should make a bit more sense, especially in terms of readability.
custom_column
Validating a single columns is sometimes of little use as what we would really want to validate is result of combining multiple columns. That is exactly what
custom_column
is for: it allows for any custom SQL to be considered a “virtual” column onto which a validation can then be applied.Suppose we would like to check whether the combination of the
SupplierID
andSupplierName
is unique. We can easily do that using thecustom_column
parameter as in the example below:[[test_sqvid_db.suppliers.SupplierID]] validator = 'unique' custom_column = "SupplierID || '-' || SupplierName"
The fact that a custom column is used gets represented in the printed report as well. The validator definition above would have resulted in something like the following:
PASSED: Validation on [test_sqvid_db] suppliers.SupplierID (customized as 'SupplierID || '-' || SupplierName') of unique
severity
A “severity” of a validation defines what happens when it fails. It can be set to either
error
(the default) orwarn
. When set towarn
, the validation will still report the results (i.e. which rows did not meet the validation criteria) but the validation will otherwise behave as if it passed.Here is a quick example. Provided that the IDs stored in SupplierID would not have been unique, the following validator would only cause a warning – the
sqvid
call would still have ended with exit code 0, as if the validation has passed.[[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' severity = 'warn' args = {min = 3, max = 256}
Severity set to
warn
will also be presented insqvid
’s output – the sample output from the validation defined above can be found below:FAILED (WARN ONLY): Validation on [test_sqvid_db] suppliers.SupplierID of in_range({'min': 3, 'max': 256}) Offending 2 rows: +--------------+------------------------------+--------------------+------------------+---------------+--------------+-----------+------------------+ | SupplierID | SupplierName | ContactName | Address | City | PostalCode | Country | Phone | +--------------+------------------------------+--------------------+------------------+---------------+--------------+-----------+------------------+ | 1 | Exotic Liquid | Charlotte Cooper | 49 Gilbert St. | Londona | EC1 4SD | UK | (171) 555-2222 | | 2 | New Orleans Cajun Delights | Shelley Burke | P.O. Box 78934 | New Orleans | 70117 | USA | (100) 555-4822 | +--------------+------------------------------+--------------------+------------------+---------------+--------------+-----------+------------------+
report_columns
With tables that have a large number of columns, it becomes quite difficult to just print out all the columns relevant to a specific validation that failed.
All columns are printed (or reported) by default. Using the
report_columns
parameter a subset of the column list can be selected.For instance the following validation declaration
[[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' args = {min = 3, max = 256}
would result in the following output:
FAILED: Validation on [test_sqvid_db] suppliers.SupplierID of in_range({'min': 3, 'max': 256}) Offending 2 rows: +--------------+------------------------------+--------------------+------------------+---------------+--------------+-----------+------------------+ | SupplierID | SupplierName | ContactName | Address | City | PostalCode | Country | Phone | +--------------+------------------------------+--------------------+------------------+---------------+--------------+-----------+------------------+ | 1 | Exotic Liquid | Charlotte Cooper | 49 Gilbert St. | Londona | EC1 4SD | UK | (171) 555-2222 | | 2 | New Orleans Cajun Delights | Shelley Burke | P.O. Box 78934 | New Orleans | 70117 | USA | (100) 555-4822 | +--------------+------------------------------+--------------------+------------------+---------------+--------------+-----------+------------------+
Whereas the following validation
[[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' report_columns = [ 'SupplierID', 'SupplierName' ] args = {min = 3, max = 256}
would then output the following:
FAILED: Validation on [test_sqvid_db] suppliers.SupplierID of in_range({'min': 3, 'max': 256}) Offending 2 rows: +--------------+------------------------------+ | SupplierID | SupplierName | +--------------+------------------------------+ | 1 | Exotic Liquid | | 2 | New Orleans Cajun Delights | +--------------+------------------------------+
limit
The limit of rows to report can also be specified per particular validation. This value defaults to the
limit
set on the validation config level.For instance
[[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' report_columns = [ 'SupplierID', 'SupplierName' ] args = {min = 3, max = 256} limit = 1
would report only a single offending row.
This configuration option also allows one to selectively turn off the
limit
set on the validation config level. For instance[[test_sqvid_db.suppliers.SupplierID]] validator = 'in_range' report_columns = [ 'SupplierID', 'SupplierName' ] args = {min = 3, max = 256} limit = false
would report all offending rows, regardless of what the setting of
limit
in thegeneral
section was.