Welcome to the ninth issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.
Until further notice, each issue will contain of two parts:
a list of data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).
Let’s get started!
Data Events
Recent
I’m still in the process of uploading recordings from this year’s Knowledge Gap data modeling & data architecture conference. They’ll keep appearing in my YouTube channel on the Knowledge Gap playlist over the next few weeks.
Upcoming
If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net.
2022-07-13: UK Data Vault User Group on data mesh (online)
2022-07-18: Data Modeling Meetup on data vault and anchor modeling with Nikolai Golov (online)
2022-09-06/08/13: Data Modeling Certification (DMC), TEDAMOH (online)
2022-10-10/12: Data Vault Training (CDVDM), Genesee Academy (online)
Completeness & Consistency (Part 3)
The consumers of data from a data warehouse usually expect the data in the presentation layer to be clean, complete and consistent. In most cases, this expectation is somewhat at odds with the data quality found in the operational systems that feed the data warehouse.
Historically, data warehouse developers have tried to hide all these data quality issues from consumers and to deliver a sugarcoated version of the data, often at great expense. But a more nuanced approach tends to offer a better cost-benefit ratio. In this little series, we’re looking at examples for such an approach.
Timing Issues
In data warehousing, timing issues keep popping up again and again. If anything, they have become more frequent with the trend towards near real-time and independent parallel loading. But if you keep them in mind when building your load processes, most of them can be handled pretty easily.
Child Before Parent
Having a record about an instance of a concept in a link or satellite but not yet in the corresponding hub (the child is loaded before the parent) is one of the issues related to independent parallel loading.
If you use sequence keys or have foreign keys enabled, this can’t happen because you have to look up the sequence key in the hub (and create one if it doesn’t exist) before loading the link or satellite. If you load all data vault objects independently and in parallel, on the other hand, it’s perfectly possible that a link or satellite load finishes before the corresponding hub load. And the related issue of having a record about an instance of a concept in one satellite but not yet in another (one sibling is loaded before the other) can occur even without independent parallel loading.
There are two possible solutions for this problem:
You can delay loading downstream objects (in the business vault or in the presentation layer) until all relevant extracts from source systems have been processed, losing some of the performance improvements promised by independent parallel loading.
Or you can take the latest load dates of the relevant hubs, links and satellites from your load metadata and only propagate data to downstream objects that has been loaded until whatever the oldest of these latest load dates is (see here).
For any setup where foreign keys aren’t enabled, it might make sense to regularly run checks on your hubs if all the expected surrogate identifiers are there. In case some are missing, you should investigate further to make sure there aren’t any bugs in your load processes.
System A Before System B
When dealing with overlapping data sets (see previous issue), there are additional classes of timing issues that you should be aware of.
If data about an instance of a concept is stored in two different source systems, it’s possible that one system updates its data earlier or more frequently than the other (different source system change timing). In this case, ranking your sources by authoritativeness or trustworthiness is especially important to make sure that in downstream objects, you don’t replace data from a trustworthy source with data from a less trustworthy one just because it is more recent.
If data about an instance of a concept is stored in two different source systems, your load processes might cause yet another timing issue (different source system extract timing).
If you load a snapshot from one source system at one point in time (e. g., before midnight) and a snapshot from the other source system at another point in time (e. g., after midnight), combining them might paint a misleading picture of the data because it combines two different versions of the data into something that hasn’t ever existed outside the data warehouse. In this case, you should either try to minimize the time distance between the two extracts or use a different extraction mechanism where you don’t just get daily snapshots but also intraday changes (e. g., change data capture).
Outlook
In the next few issues of Model Your Reality, we’ll continue looking at various kinds of data quality and error handling issues in more detail. The approaches described in this series will help you to achieve a reasonable level of completeness and consistency in your data warehouse with a reasonable level of effort.
About overlapping context challenge, if an instance always exist in both systems. You can choose what system is master for what context (attribute). Source the “Name” from one system and the “Address” from the other system, even if both attribute exists in both system. But it means you need to understand the IT landscape data movement and Business Processes purposes and that is to hard for most people. So sourcing all and hope for the best is pretty normal approach
Nice post again! In this post you completely take for granted that a data vault model will be used (no objections from my end). you may want to link to your earlier Data Vault Modeling Patterns (and from there to this one?).