Welcome to Model Your Reality, a newsletter with musings about data structure design, data warehousing and the like.
Until further notice, each issue will contain of two parts:
a list of data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).
Let’s get started!
Data Events
Recent
I’ve uploaded the recording from the Data Modeling Meetup on activity schema with Timo Dechau here.
Upcoming
If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net. Online events preferred.
2022-12-05/06/07: Data Vault Training (CDVDM), Genesee Academy (Stockholm)
2022-12-12: Ghosts of Data Warehousing Past, Present and Future (online)
2022-12-14: UK Data Vault User Group on data mesh and data vault (online)
2023-02-22: Data Modeling Meetup with Marielle Dado (online)
2022-03-06–08: Data Vault Training (CDVDM), Genesee Academy (online)
2023-03-07: Data Modeling Meetup with Nikolai Golov et al. (online)
2023-03-13/17: Data Modeling Masterclass (in German, online)
2023-03-20: Data Modeling Meetup with Dirk Lerner (online)
2023-05-24/25/26: Knowledge Gap data modeling & data architecture conference (online)
Completeness & Consistency (Part 5)
Previous entries in the series can be found here, here, here and here.
This series assumes tha you are using a data vault approach. If you are new to data vault, you might want to check out my Modern Data Warehousing with Data Vault series before.
The consumers of data from a data warehouse usually expect the data in the presentation layer to be clean, complete and consistent. In most cases, this expectation is somewhat at odds with the data quality found in the operational systems that feed the data warehouse.
Historically, data warehouse developers have tried to hide all these data quality issues from consumers and to deliver a sugarcoated version of the data, often at great expense. But a more nuanced approach tends to offer a better cost-benefit ratio. This is the last part of a little series in which we’re looking at examples for such an approach.
Error Handling
Bad data or bugs in load processes might prevent records from being loaded to your data vault in the first place.
There might be duplicates caused by inconsistencies in the source system or a human-readable identifier that isn’t as unique as you thought it would be. The data type of a certain column might have become to small after changes in the source system (or has been too small from the beginning without anybody noticing it).
Error Satellites
The errors caused by these issues shouldn’t break your load processes but they should also not go unnoticed. If you can’t load a record into a hub, link or satellite, you should consider putting it in an error satellite.
This satellite should have the largest possible data types for columns and shouldn’t have any uniqueness constraints so that it can ingest all the erroneous records for further analysis. As an additional check, you can compare the records in the staging area to the records in the target objects to see if anything is missing there.
Error Marts
If you want to make the errors (or at least statistics about errors) visible to the stakeholders of your data warehouse, consider creating an error mart in the presentation layer.
As always, think about the right presentation mode for your consumers (see here). With the help of such an error mart, you should be able to identify and tackle the most prevalent error types.
Corrective Measures
Depending on the reasons for the errors, different types of corrective measures might be necessary:
If the errors are caused by data quality issues in source systems, you might want to implement some of the data quality tools from part 4.
If the errors are caused by bugs in your code, you should fix these bugs and think about automated testing for loading routines and business logic.
If the errors are caused by changes in source system table structures or file formats that you didn’t know about, you might want to implement a detection mechanism for these changes (for example by monitoring source system data dictionaries) or just work on your relationship with the people that are responsible for these systems so that they will tell you about any changes in their systems in the future.
Alternatively, you can try to decrease the number of source system changes you have to deal with by agreeing on a reasonably stable data interface between source systems and data warehouse. This interface can be part of a data contract between you and the people managing the respective source system.
Whenever you have implemented any of these corrective measures, you can then use the error mart to see their (hopefully positive) effects and communicate them to stakeholders.
And now that you have read all the entries in this little series, you hopefully are able to achieve a reasonable level of completeness and consistency in your data warehouse with a reasonable level of effort.