As data practitioners, we need to ensure that data is accurate in order to build trust in the analytics we deliver. There are many ways to identify these exceptions, but we need a scaleable approach when working with large amounts of data. We need a simple approach where a data practitioner can quickly analyze large datasets and identify these exceptions quickly. This is where dbt comes in.
dbt, also known as data build tool, is a data transformation tool that leverages templated SQL to transform and test data. It was developed by dbt Labs, formerly known as Fishtown Analytics. Data build tool is part of the modern data stack and helps practitioners apply software development best practices to data pipelines. Some of these best practices include code modularity, version control, and continuous testing via its built-in data quality framework. In this article, we will focus on how data can be tested with dbt via built-in functionality and with additional extensions. In future articles, we will dive deeper into each of these areas.
In dbt there are two categories of tests: Generic tests (formerly known as schema tests and Bespoke tests (formerly referred to as data tests). The difference between the two are that Generic tests are reusable functions for which you do not have to write the SQL query while Bespoke tests are custom tests you write when a generic test does not cover your use case. Regardless of the type of test, behind the scenes the process is the same, dbt will compile code to SQL and execute that against your database. If any rows are returned by the query, that indicates a failure.
dbt Core ships with four basic tests:
unique: is a test to verify that every value in a column (e.g. customer_id) contains unique values. This is useful for finding records that may inadvertently be duplicated in your data.
not_null: is a test to check that the values for a given column are always present. This can help you find cases where data in a column suddenly arrives without being populated.
accepted_values: this test is used to validate whether a set of values within a column is present. For example, in a column called payment_status, there can be values like pending, failed, accepted, rejected, etc. This test is used to verify that each row within the column contains one of the different payment statuses, but no other. This is useful to detect changes in the data like when a value gets changed such as accepted being replaced with approved.
relationships: these tests check referential integrity. This type of test is useful when you have related columns (e.g. product id) in two different tables. One table serves as the “parent” and the other is the “child” table. This is common when one table has a transaction and only lists a customer_id and the other table has the details for that customer. With this test we can verify that every row in the transaction table has a corresponding record in the dimension/details table. For example, if you have orders for customer_ids 1, 2, 3 we can validate that we have information about each of these customers.
Generic tests were enhanced in dbt version 0.20. Now a where clause can be added to a generic test to focus the test on a particular range of rows. This can be useful on large tables by limiting the test to recent data. Another improvement is the ability to define severity based on the number of exceptions. This would allow a test to pass with a warning below a certain threshold and become a true error above that threshold.
Bespoke/custom tests allow you to create tests when the generic ones(or the ones in the packages discussed below) do not meet your needs. These tests are simple SQL queries that express assertions about your data. An example of this type of test is that sales for one product should be within +/- 10% of another product. The SQL simply needs to return the rows that do not meet that condition.
In addition to the genetic tests that can be found within dbt Core, there are a lot more in the dbt ecosystem. These tests are found in dbt packages. Packages are libraries of reusable SQL code created by members of the dbt community. We will briefly go over some of the tests that can be found in these packages.
The dbt-utils package is a library that was created by dbt Labs. It contains special schema tests, SQL generators, and macros. Some of the tests in the dbt_utils package include:
not_accepted_values: this test is the opposite of the accepted_values test and is used to check that specific values are NOT present in a particular range of rows.
equal_rowcount: this test checks that two different tables have the same number of rows. This is a useful test that can assure that a transformation step does not accidentally introduce additional rows in the target table.
fewer_rows_than: this test is used to verify that a target table contains fewer rows than a source table. For example, if you are aggregating a table, you expect that the target table will have fewer rows than the table you are aggregating. This test can help you validate this condition.
There are several additional tests available in the dbt-utils package which can be found in the documentation.
Another package that can accelerate your data testing is dbt-expectations. This package is a port of the awesome Python library Great Expectations. For those not familiar, Great Expectations is an open-source Python library that is used for automated testing. dbt-expectations is modeled after this library and was developed by Calogica so dbt practitioners would have access to an additional set of pre-created Generic tests for dbt testing. Tests in dbt-expectations are divided into seven categories:
We will focus on the tests in each category in a future article.
Getting started with dbt testing is simple thanks to the predefined tests found within dbt Core and the additional tests found in dbt-utils and dbt-expectations, These can be used to assure various aspects of data quality using dbt. In all, dbt has 4 test, the dbt-utils package has 12 tests and dbt-expectations package has a total of 58 tests for a total of 74 tests we don’t need to write. When we need to deviate from those tests, we can create our own using Bespoke tests and if they are reusable, we can create reusable macros which we can share within our organization or with the dbt community. These tests can help you increase trust in your analytics by alerting you of error conditions before your users notice. To find out more about dbt and its capabilities, dbt Labs has a course that can introduce you to the basics of dbt for free here.