Completeness & Consistency (Part 4)

Data Quality and Quality Data

Sep 09, 2022

Welcome to the twelfth issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.

Until further notice, each issue will contain of two parts:

a list of data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).

Let’s get started!

Data Events

Recent

I’m still in the process of uploading recordings from this year’s Knowledge Gap data modeling & data architecture conference. They’ll keep appearing in my YouTube channel on the Knowledge Gap playlist over the next few weeks.

Upcoming

If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net. Online events preferred.

2022-09-19: Data Modeling Meetup on SAP, data vault and Coalesce (online)
2022-10-04: Data Modeling Meetup on data products and data marts (online)
2022-10-05: UK Data Vault User Group on lessons from the field (online)
2022-10-06/13: Data Modeling Masterclass, TEDAMOH (in German, online)
2022-10-10/12: Data Vault Training (CDVDM), Genesee Academy (online)
2022-10-25: Data Modeling Meetup on Activity Schema (online)
2023-05-24/25/26: Knowledge Gap data modeling & data architecture conference (online)

Completeness & Consistency (Part 4)

Previous entries in the series can be found here, here and here.

This series assumes tha you are using a data vault approach. If you are new to data vault, you might want to check out my Modern Data Warehousing with Data Vault series before.

The consumers of data from a data warehouse usually expect the data in the presentation layer to be clean, complete and consistent. In most cases, this expectation is somewhat at odds with the data quality found in the operational systems that feed the data warehouse.

Historically, data warehouse developers have tried to hide all these data quality issues from consumers and to deliver a sugarcoated version of the data, often at great expense. But a more nuanced approach tends to offer a better cost-benefit ratio. In this little series, we’re looking at examples for such an approach.

General Data Quality Issues

Apart from simple timing issues as discussed in the previous entry, you can encounter all kinds of data quality issues caused by errors in the sources or in the business logic implemented in the data warehouse.

What Can Go Wrong

These data quality issues include but are not limited to:

Values that should be there might be missing.
Numbers might have been entered in the wrong field (e. g., turning a social security number into a monetary amount).
Parts of names or addresses might have been misspelled.
A calculation might have been implemented by a developer who wasn’t aware of some important special case.
A correction might have been loaded to the raw vault but wasn’t picked up by some of the downstream objects.

You should be able to detect these issues and have them fixed quickly (preferably at the source). For this purpose, it makes sense to create data quality satellites in your data vault and/or data quality screens in the presentation layer.

Managing Expectations

Regardless of where and how you store the results, you should define some rules about your expectations for good-quality data:

Which columns shouldn’t be null at any time?
Is there a lower and/or an upper bound for the values in a column?
Is there a list of allowed values for a column?
Are there any dependencies between columns (e. g., if one column is null, another mustn’t be null, and vice versa)?
Are there any mandatory relationships between concepts (e. g., if there is a Financial Transaction, there must be a Bank Account for it)?

Don’t try to compile and enforce an exhaustive list of rules immediately. Start with a small set of simple but important rules and add more later when necessary.

Try to automate the evaluation of data quality rules wherever possible: Instead of hand-coding dozens of not-null checks, generate them from a list of columns in a configuration table.

Implementing Data Quality Handling

If you want to bake data quality into your data vault, consider storing the results of these rules in dedicated data quality satellites. You can choose from several implementation options:

Add satellites to the hubs for which you want to execute data quality checks with one column for each rule. If you add a rule, you add a column.
Add multi-active satellites to the hubs for which you want to execute data quality checks with one single column to store the results of all data quality rules. If you add a rule, you just load more records to the multi-active satellite.
Add a dedicated data quality rule hub and link it to all hubs for which you want to execute data quality checks. Store the results of the data quality rules in link satellites.
Add a dedicated data quality rule hub and link it to all hubs for which you want to execute data quality checks. Store the results of the data quality rules in satellites attached to keyed-instance hubs (e. g. Customer Data Quality Rule Evaluation).

Presenting Data Quality (and Quality Data)

In addition, you might want to aggregate the results from all the data quality rules for a hub and only load those instances to the presentation layer that pass all the checks (or at least all the important ones).

Alternatively, you can use these aggregate results to label instances in the presentation layer (e. g., using a traffic-lights system where green indicates good quality, yellow indicates minor data quality issues and red indicates major data quality issues).

Either instead of data quality satellites or as a more user-friendly interface for them, you can create data quality screens in the presentation layer. Depending on the requirements of those who will work with them, these screens can use different presentation modes (see here for more details on the different modes).

Presentation layer quality screens can be implemented as simple lists containing identifiers, suspect attribute values and the reason(s) why they are suspect or as elaborate dimensional models (with data quality rules as one of the dimensions).

Finally, raw marts can be a data quality tool as well. If operational systems make it hard to identify data quality issues, presenting all the raw data from an operational system in a nice, understandable format might be exactly what you need to find these issues.

Outlook

In one of the next issues of Model Your Reality, we’ll finish this little series with a closer look on error handling issues. After you have read all the entries, you hopefully will be able to achieve a reasonable level of completeness and consistency in your data warehouse with a reasonable level of effort.