Completeness & Consistency (Part 1)
Welcome to the seventh issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.
Until further notice, each issue will contain of two parts:
a list of data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).
Let’s get started!
Data Events
Recent
In case you missed the Data Modeling Meetup with Ronald G. Ross on how to tackle the real reasons for data quality issues, you can watch the recording here.
Upcoming
If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net.
2022-05-11: Snowflake Data Vault User Group with me (online)
2022-05-18: UK Data Vault User Group with Dirk Lerner (online)
2022-05-19: Data Modeling Meetup with Roelant Vos (online)
2022-06-01/03: Knowledge Gap data modeling & data architecture conference (online)
2022-06-02: The Present and Future of Data Warehousing (free Knowledge Gap companion event, online)
2022-06-20/22: Data Vault Training (CDVDM), Genesee Academy (online)
2022-06-20/22: TDWI Conference (Munich, Germany)
2022-06-21: Data Modeling Beergarden – TDWI Special 2022 (Munich, Germany)
Completeness & Consistency (Part 1)
The consumers of data from a data warehouse usually expect the data in the presentation layer to be clean, complete and consistent. In most cases, this expectation is somewhat at odds with the data quality found in the operational systems that feed the data warehouse.
Dealing with Existing Data Quality Issues
Historically, data warehouse developers have tried to hide all these data quality issues from consumers and to deliver a sugarcoated version of the data, often at great expense. But a more nuanced approach tends to offer a better cost-benefit ratio:
Fixing a data quality issue at the source is almost always the cheapest and most sustainable solution. Instead of masking problems in operational data with complex business logic, the data warehouse should make it easier to fix them by exposing them through raw marts, error marts or data quality screens. Ideally, an automatic process reports errors and inconsistencies directly back to the operational systems.
In case it’s not possible to fix an issue at the source, it often makes sense to let subject matter experts and other business users make the corrections. They know how business processes are supposed to work and are usually in a better position than some random data warehouse developer to provide the correct values. These corrections then can serve as another source to the data warehouse that overrides bad data from operational systems.
Often, data describing a certain concept or subject area comes from different sources but not all these sources are of equally good quality. A simple ranking of these sources can make sure that only the data from the most reliable source that is available for an instance of a concept makes it to the presentation layer.
Only if none of the above helps, you should seriously consider implementing complex logic to deal with data quality issues. In this case, try to implement this logic in a way that is understandable to whoever will come after you and reusable for other consumers.
Not Adding New Data Quality Issues
Given all the data quality issues you are likely to encounter in your sources, you should try not to add issues of your own making when loading data to your data warehouse.
Loading everything in parallel and presenting the result to downstream consumers in a near real-time fashion sounds sexy but can lead to timing issues that creates incomplete, inconsistent data instead of avoiding it. If your consumers are fine with a data mart that is refreshed every night, giving them exactly that (and not more) might make it easier for you to present high-quality data.
Outlook
In some of the next issues of Model Your Reality, we’ll look at various kinds of data quality, integration and error handling issues in more detail. The approaches described there will help you to achieve a reasonable level of completeness and consistency in your data warehouse with a reasonable level of effort.