Data quality is often seen as the unglamorous component to working with data. Ironically, it’s usually the component that takes up the majority of our time. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless.
The challenge with data quality is that there are no clear and simple formulas for determining if data is correct. Like a pond without a freshwater source, data can become stagnant quickly. Data that is correct today, might not be correct in a month. Fortunately, there are some fundamental techniques and approaches that can be broadly applied when validating data quality. The general theme of data quality is around finding outliers that do not meet specific requirements and record sets that violate business assumptions.
When working with moving data, data can be thought about in three separate layers: the ETL layer, the business layer, and the reporting layer. The ETL layer contains the code for data ingestion and data movement between a source system and a target system (for example from the application database to the data warehouse). The business layer sits between your raw ingested data and your final data models. Finally, the reporting layer contains the dashboards from which business users can view and interact with. In the following sections, we will cover the techniques that are applicable for each data layer.
The following examples are done in Postgres SQL. I will be using two tables, a Clicks table, and a Views table. The clicks table represents data about users that clicked a link on a webpage. The Views table represents data about users that have visited a webpage.
Data Quality in the ETL Layer
The first layer of a data pipeline is the ETL layer. The data quality checks in this layer are usually similar regardless of the business needs and differing industries. The goal here is to check to ensure that data is not lost or degraded while moving from the source to the target system. We check for things such as differences in row counts (showing data has been added or lost incorrectly), partially loaded datasets (usually with high null count), and duplicated records.
Example Using SQL
WITH source_count as (Select count(*) as total_count from source) ,
target_count as (Select count(*) as total_count from target)
SELECT CASE WHEN (select total_count from source_count) = (select total_count from target_count) THEN True ELSE False END as valid_row_count FROM pipe_table_1
In the above example, we are looking for a scenario where the counts from the source and target do not match. In most instances, the number of rows should stay consistent between both tables. This is to ensure that transformations that occur between tables do not unexpectedly change row counts. For instance, the number of users on your application should match between your application database and your data warehouse. There are also instances where the number of rows are expected to change. In this case, the goal should be to validate that counts fall within an expected range. Also, it’s important to note that common table expressions are used here to maintain readability in the SQL statement.
WITH join_count AS (
count(*) AS view_count
LEFT JOIN clicks ON clicks.view_id = views.id
CASE WHEN (SELECT view_count FROM join_count) = count(1) THEN True ELSE False END AS valid_join
In the example above, we are checking to ensure that the row count of the table after the left join is the same as the row count of the original table. This is to ensure that there are no duplicate keys on either table (this can usually be enforced by the database, but not when working with derived tables or views). Again, this constraint is not always true as some joins are expected to cause the rows in the joined record to increase or decrease, in which case it is necessary to understand the expected range of values.
Data Quality at the Business Layer
The next layer of a data pipelines is the business logic layer. Typically, data quality checks in this layer are applied after the raw/partially transformed data has been loaded to a staging area in your data warehouse. You do not want to load the data into the final destination until these checks are passed and invalid rows are inspected. The goal here is to ensure that fundamental business understandings are not violated and that the data makes business sense.
These business quality checks also act as a secondary check or safeguard to ensure that the data has been brought in to the database correctly from the ETL layer (as it can be extensive and impractical to validate every data quality measure in the ETL layer).
Business layer checks typically involve validating that numerical measures fall within a valid range as defined by business requirements.
SELECT count(*), year
JOIN date b on a.date _id = b.date_id
GROUP BY 1,2
HAVING count(*) > [EXPECTED_RANGE];
SELECT count(*), year, month
FROM Views a
JOIN date b on a.date _id = b.date_id
GROUP BY 1 2,3
HAVING count(*) > [EXPECTED_RANGE];
One technique is to observe a change in a given metric or measure over different time periods. In this example, we are looking at views over different years as well as over different months.
We are trying to identify a sudden unexpected spike or drop in user views. It is also necessary to distinguish between expected and unexpected changes. In our example we are looking at eyeballs, also referred to as user views, for an e-commerce site. One expected trend is for views to increase during the holiday seasons.
— Validating the number of views should always be greater than the number of clicks
count(a.num_view) >= count(b.num_clicks) THEN True ELSE False END as click_validation
LEFT JOIN clicks b ON b.click_id = a.click_id
Another commonly seen business validation is to check that values follow certain boundaries or certain business rule. In our case, the number of users who view a page should always be more or equal to the number of users who click on a link (since you can’t click a link before viewing the page first). Since these checks are heavily reliant on business constraints, it is important to also ensure that the business rules and assumptions are properly documented either in the code and/or in a metadata repository.
Data Quality at the Reporting Layer
The reporting layer is the final layer of a data pipeline. This is the layer that end users interact with your data. However, just because this layer is usually the end of the pipeline, does not mean the reporting layer should not be utilized by data engineers and data analyst to ensure data quality.
One simple technique is to plot your data points over different periods of time (year, month, day) and visually inspect for outliers. It is pretty common to catch data quality mistakes visually that might not easily be captured in your validation checks.
Another excellent use case for the reporting layer is to create data quality dashboards. You can combine multiple data quality checks together and have these metrics/graphs sent out to you on a daily basis before sending the data out to your executive/business users.
Fundamentally, data quality validations should be automated as much as possible. Validations should be embedded into the data pipeline code but in a manner that allows it to be effortlessly changed. Depending on the criticality of the data and validation, you may want your pipeline to either fail completely or to flag the issue, move records into a separate reject area, or continue processing.
There are tools, such as Trifacta, that will help to simplify and automate many of these data quality checks. Existing ETL tools, like Informatica, may also have data quality checks features already built in. However, it’s still important to understand how to implement data quality from scratch to ensure that the data quality checks you implement make sense.
When designing data pipelines, data quality should be a driving factor that heavily influences the development effort. Data engineers should have a level of familiarity with their data to be able to enhance and debug data quality issues. It’s important to note that data quality should not the responsibility of a single team or individual. In my opinion, it’s critical for engineers to be as familiar with the code as they are with the data they are working on.
This article provides a broad overview of data quality, techniques for how to monitor it, and strategies for actively working with it. As we stated earlier, data quality is a critical component of a data system. Investing the time and resources to work with data quality is important! Data quality should be placed at the same level of importance as unit testing is to software development.