Welcome to the eighth issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.
Until further notice, each issue will contain of two parts:
a list of data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).
Let’s get started!
Data Events
Recent
In case you missed the Data Modeling Meetup with Roelant Vos on how business modeling helps you build a better data vault, you can watch the recording here.
I’m currently in process of uploading recordings from this year’s Knowledge Gap data modeling & data architecture conference. They’ll appear in my YouTube channel on the Knowledge Gap playlist over the next few weeks.
Upcoming
If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net.
2022-06-13/15: Budapest Data Forum (Budapest, Hungary and online)
2022-06-15: UK Data Vault User Group on ontologies (online)
2022-06-20/22: Data Vault Training (CDVDM), Genesee Academy (online)
2022-06-20/22: TDWI Conference (Munich, Germany)
2022-06-21: Data Modeling Beergarden – TDWI Special 2022 (Munich, Germany)
2022-07-04: DAMA South Africa on kinds of time with myself (online)
2022-07-13: UK Data Vault User Group on data mesh (online)
Completeness & Consistency (Part 2)
The consumers of data from a data warehouse usually expect the data in the presentation layer to be clean, complete and consistent. In most cases, this expectation is somewhat at odds with the data quality found in the operational systems that feed the data warehouse.
Historically, data warehouse developers have tried to hide all these data quality issues from consumers and to deliver a sugarcoated version of the data, often at great expense. But a more nuanced approach tends to offer a better cost-benefit ratio. We’ll look at examples for such an approach in the next few issues.
Integration Issues
One of the main tasks of a data warehouse is the integration of data from different sources. This means that when building a data warehouse, you will face integration issues of varying severity.
Non-overlapping Data Sets
One of the simpler issues you might encounter is bringing together non-overlapping data sets describing the same concept (e. g. Sales data from different stores or even different countries). The recipe for handling this issue is pretty straightforward:
Load all identifiers from the different non-overlapping sources to a single hub representing the concept (e. g. SALE_HUB).
Load all details from each of the different non-overlapping sources to a satellite attached to that hub. Just in case your source aren’t as non-overlapping as you thought, you should create one satellite per source (instead of one big satellite for all sources).
For presentation, combine the data from all the satellites with a simple UNION ALL statement and join the result to the hub if necessary.
Optionally, you can add precautionary measures in case any sources overlap (like picking the latest satellite record for this instance, sending a notification to someone and/or loading the overlapping records to an error mart).
Overlapping Data Sets
Bringing together overlapping data sets describing the same concept is a bit more complicated. In this case, you need to choose a strategy to pick details from one source over the other. In practice, this means you have to rank your sources by authoritativeness or trustworthiness. You can do this implicitly by using a COALESCE statement (e. g., COALESCE(SALE_A_SAT.ATTRIBUTE, SALE_B_SAT.ATTRIBUTE, SALE_C_SAT.ATTRIBUTE)
) or explicitly by assigning a numerical value to each source and then picking the smallest one available for each instance (e. g., MIN(SOURCE_RANK) OVER (PARTITION BY SALE_SRG)
).
If you’re ranking your sources explicitly, consider storing the ranking in some kind of configuration table instead of hard-coding the same ranking in all kinds of business vault objects over and over again. Keep in mind that you might have to rank your sources differently for different concepts or different subject areas.
In some cases, a simple ranking isn’t enough because you have to pick one source over the other depending on more complex rules. These cases can usually be handled using CASE … WHEN …
statements that might look like this:
CASE
WHEN CUSTOMER_TYPE = ‘B2C’ THEN SALE_A_SAT.ATTRIBUTE
WHEN CUSTOMER_TYPE = ‘B2B’ THEN SALE_B_SAT.ATTRIBUTE
ELSE SALE_C_SAT.ATTRIBUTE
END
Identification of Instances
The thorniest integration issues revolve around the identification of instances. If you have overlapping sources for details about a concept and one of them sometimes or always doesn’t provide a usable identifier, you have to somehow identify the instances coming from that source so that you can match them to instances from other sources.
For example, your organization might have a customer ID to uniquely identify Customers. However, Customers often contact you without providing this ID (because they have forgotten their ID or never known it in the first place). This means that you have to take whatever details the Customer has given you and compare it to the details you already have about your Customers (like names, email addresses or phone numbers) to determine which Customer you are dealing with, exactly. Only then can you connect the Customer Message to a specific Customer with some confidence.
Identification processes can be complicated and deliver results of varying reliability. If you’re dealing with a process like that, document metadata like the level of confidence or the attribute(s) used for identification in your data vault.
Outlook
In the next few issues of Model Your Reality, we’ll continue looking at various kinds of timing, data quality and error handling issues in more detail. The approaches described there will help you to achieve a reasonable level of completeness and consistency in your data warehouse with a reasonable level of effort.