Relational vs. Ensemble

It’s About Time

Aug 12, 2022

Welcome to the eleventh issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.

Until further notice, each issue will contain of two parts:

a list of data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).

Let’s get started!

Data Events

Recent

I’m still in the process of uploading recordings from this year’s Knowledge Gap data modeling & data architecture conference. They’ll keep appearing in my YouTube channel on the Knowledge Gap playlist over the next few weeks.

Upcoming

If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net. Online events preferred.

2022-09-06/08/13: Data Modeling Certification (DMC), TEDAMOH (online)
2022-09-19: Data Modeling Meetup on SAP, data vault and Coalesce (online)
2022-10-04: Data Modeling Meetup on data products and data marts (online)
2022-10-06/13: Data Modeling Masterclass, TEDAMOH (in German, online)
2022-10-10/12: Data Vault Training (CDVDM), Genesee Academy (online)
2023-05-24/25/26: Knowledge Gap data modeling & data architecture conference (online)

Relational vs. Ensemble

The relational model as such doesn’t support temporality very well. Temporality, however, is essential for data warehousing. So, you need to make some modifications.

Back in the 1990s, people naively added timestamps to 3NF tables (3NF+T, early Inmon architecture) but this had some issues with ripple effects of changes etc., leading to what some call ensemble modeling styles like head-version, data vault and anchor modeling who can handle relationships and structural changes better.

To be able to store historized data, the primary key of your relation has to include both an identifier for an instance and a timestamp for the primary timeline (see Time in Databases for the terminology). This leads to problems both with normalization (probably not all attributes are dependent on identifier + timestamp, some are most likely only dependent on the identifier itself) and foreign keys (usually, you’d want to reference the instance, not a version of the instance with identifier + timestamp).

The best way to remedy this is to separate the things that change from the things that don’t change. So, you will split the relation into a non-temporal part (the identifier plus maybe some immutable attributes) and a temporal part (identifier, timestamp and mutable attributes). And voila, you are using an ensemble modeling pattern (head + version, hub + satellite, anchor + attribute(s), ...).

If then you decide to always use intersection tables for relationships (as suggested by Fabian Pascal and David McGoveran for clean 5NF relations), you have re-invented Data Vault (hubs, satellites and links).

If on top of that, you group your attributes into distinct information areas like name, address and geographical area, you have re-invented Focal.

And if you decide to go full 6NF and put each attribute in its own table, you have re-invented Anchor Modeling.

After all this, I hope the benefits of an ensemble modeling approach are obvious: it helps you handle change over time (both changes of attribute values and structural changes like adding attributes or adding relationships).

This issue is based on a LinkedIn post the contents of which will hopefully be easier to find now that it has become a Substack issue.