Welcome to the first issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.
Until further notice, each issue will contain of two parts:
a list of upcoming data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).
Let’s get started!
Upcoming Data Events
If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net.
2021-11-24: Data Modeling Meetup with Petr Beles (online)
2021-11-25: Business Vault – Das unbekannte Wesen, DDVUG (online, in German)
2021-11-25/26: Temporal Data in a Fast-changing World, TEDAMOH (online, in German, multiple appointments)
2021-11-29/30: ELM Deep Dive with Artifacts, Genesee Academy (online)
2021-12-07/09: Data Vault Training (CDVDM), Genesee Academy (online & Stockholm)
2021-12-08: dbtvault Meetup (online)
2021-12-14: Data Modeling Meetup with Francesco Puppini (online)
2022-06-01/03: Knowledge Gap data modeling & data architecture conference (online)
Data Vault Modeling Patterns (1/2)
If you’re new to the topic or don’t have a lot of practical data vault experience, you might want to consult my series of articles on Modern Data Warehousing with Data Vault here or have a look at one of the books recommended in the Data Vault section here to get the most out of this section.
Introduction
When building a data vault model, you encounter some of the same situations again and again. Therefore it is good to have something similar to the structural conventions described here so that you deal with these situations in a consistent manner. The data vault modeling patterns described in this little series will help you to achieve this consistency by recognizing standard situations and sticking to standard solutions for resolving them.
The most important data vault modeling patterns include options for
some of the finer points of link design,
modeling hierarchies (when one instance is subordinate to another),
modeling identity (when two instances actually refer to the same thing),
modeling headers and line items and
handling lookup values.
To raise awareness of these data vault modeling patterns and to point out the standard solution for a certain situation, consider creating some documentation similar to the one for the structural conventions. Your successors will thank you.
Keyed Instance
When representing connections between instances of concepts as a link, in many cases you have to include an additional concept in the form of a so called keyed-instance hub.
In general, you need a keyed-instance hub if:
The connection is transferable (e. g., an Employee can work for different departments over time and you have to record the currently active Department Assigment).
You have to store some details about the connection (e. g., quantity and actual unit price for a Product on a Sale).
There can be multiple distinguishable connections involving the same instances of concepts (e. g., a Visitor can go to the same Website multiple times, each time being a different Visit with different detail values).
You want to be able to connect other concepts to a specific connection without repeating certain combinations of concepts over and over again or building any link-on-link constructs (e. g., a Delivery can include some Products on a Sale but not others that are currently out of stock and will be part of a later Delivery).
If you’re using link satellites, you usually can get around creating a keyed-instance hub in cases 1 and 2 by attaching a link satellite to the respective link. In cases 3 and 4, you have to create a keyed-instance hub no matter what.
The diagram above can be seen as an example of all four cases:
A Customer can subscribe to different Services over time.
There are relevant details about a Customer–Service combination (how the Customer subscribed to the Service, why the Customer subscribed to the Service , …).
A Customer can stop using a Service and then start subscribing to the same Service again.
There are possible connections to a Customer-Service combination (you send an Invoice to the Customer for using the Service for a period of time, the Customer can file a Complaint about some problems with the Service, …).
This keyed-instance pattern occurs so often that some people even argue that every link should include a keyed-instance hub by default.
Hierarchy
Hierarchical connections are another common use case:
An Employee reports to another Employee.
A Car Part is made of multiple other Car Parts.
An Organizational Unit consists of multiple smaller Organizational Units.
In general, there are two ways of dealing with this use case.
If the two instances involved in the hierarchical connection are indeed instances of the same concept, you can create a hierarchical link that includes the same hub twice (plus maybe an additional keyed-instance hub). The Employee-reports-to-Employee connection most likely would be implemented as a hierarchical link. To avoid confusion later on, it is important to clearly mark the role played by each of the two identifiers referring to the same hub (by calling them MANAGING_EMPLOYEE_SRG and MANAGED_EMPLOYEE_SRG, for example).
If the two instances involved in the hierarchical connection represent different concepts, you can create a regular link between two different hubs (plus maybe an additional keyed-instance hub). The Organizational-Unit-consists-of-multiple-Organizational-Units connection would be implemented as a regular link if in your organization, the lower-level Organizational Units are consistently called Teams and the higher-level Organizational Units are consistently called Departments.
Identity
In some cases, you find out that what seemed like two different instances at first actually are one and the same:
Two customer codes from different source systems actually refer to the same Customer.
A new customer code actually refers to an existing Customer that has lost access to our system for some reason.
When migrating from one source system to another, new customer codes had to be created for existing Customers for technical reasons.
Many of these cases can be handled using a same-as link that includes the same hub twice (plus maybe a keyed-instance hub). One of the hub identifiers in the link refers to the master record and the other one the duplicate record. Once again, the role played by each of the two identifiers referring to the same hub should be marked clearly (by calling them MASTER_CUSTOMER_SRG and DUPLICATE_CUSTOMER_SRG, for example).
Sometimes, identity has to be documented using a regular link. Source system A might treat both Customers and Suppliers as Partners while source system B clearly distinguishes between the two. When it’s not always completely obvious which Partners from source system A are actually Customers, it is better to load all of them to a Partner hub and then create a link between this Partner hub and the Customer hub for those Partners that you have found to be Customers.
Outlook
As mentioned in the introduction of the article, there are some data vault modeling patterns for dealing with headers and line items and handling lookup values. We’ll look at them in one of the next issues of Model Your Reality.