Many organizations operate under the assumption that their data is relatively clean and possesses high quality or at least high enough quality to support accurate business decisions. However, Data Quality is often an overlooked area of enterprise data management and governance. For an organization to evolve through the data maturity layers, DQ is a critical component.
Modern ecosystems such as data lakes make it easier to quickly ingest data. This data is often not well quality checked and can lead to poor business decisions based on the data. Users can then lose trust in the data. Once that trust is lost, it is extremely difficult to re-establish. We will discuss challenges such as these and describe how Caserta can help guide an organization through this and avoid these pitfalls.
Fundamentally, most organizations do not perceive data as an enterprise asset. Therefore, it does not get the attention and treatment it deserves. An early step in evolving to a data driven organization where data is considered and governed as an asset is to establish a data quality measurement policy and practice.
Why Measure Data Quality
Very few organizations measure the cost or the value of their data assets. A slightly higher percentage, but still only 1 out of 5, say they measure the impact of data quality issues or improvements. As a consultant, I often hear the claim from clients that their data quality is good. I ask “How do you know?” More often than not, the answer is either “I guess we don’t” or “we reconciled rowcounts/totals to the system of record”. Note the past tense on the word “reconciled.”
While it is a good start that an initial high level reconciliation was performed, one must understand the DQ measurement is an ongoing process not a once and done activity. Furthermore, simply knowing you successfully loaded all 100,000 rows and they match a grand total of $25M does not mean you are in the clear. What if an account or cost center, or perhaps a product code or customer number did not load correctly. What if the product or customer is missing in the product/customer master table so the record gets eliminated when joined?
Six Types of Data Quality Checks
- Consistent Data: Data is not consistent across applications. For example, different addresses or phone numbers for customers in CRM versus ERP or billing system.
- Complete Data: Data completeness includes reporting on nulls or blanks, data that falls outside valid domains and missing data from parent tables i.e. an order with a customer number that doesn’t exist in the customer master table.
- Correct/Accurate Data: This is most commonly what people think of when they hear data quality. It can be total rowcounts, sum() totals, checking valid dates or valid numeric values and boundary checks (for example, start date < end date, or birth date > 01/01/1910, or reasonable test on a metric value)
- Change Audit Data: Data quality measuring includes providing auditability of what and when data was changed and by whom. This often includes capturing both the before and after versions of the data to enable point in time reporting.
- Unique/Non-dupes: Identify duplicate rows. If possible tag reason for duplicates. Various reasons can be: double loading, point in time snapshots not tagged, physical deletes in source not captured downstream, or updates to PK fields within the source system (yes, I have run across this)
- Timely Data: Often overlooked is measuring and capturing the operational metadata around SLA and timeliness of data. Stale data can appear to a user as “poor quality” because it doesn’t match the current state of the operational system. Measuring this not only provides a window into the “load datetime” for the user but can also help determine if the frequency is appropriate i.e. are intraday batches or near reat-time updates required versus nightly batches.
Implementing repeatable checks on the types above and capturing the results will help an organization to report on and trend DQ issues over time. It will also provide metrics to show the business value achieved by a reduction in overall DQ issues.