Welcome to the fourth issue of Model Your Reality, a newsletter with musings about data modeling, data warehousing and the like.
Until further notice, each issue will contain of two parts:
a list of upcoming data events that might be of interest for you (they definitely are of interest for me) and
some thoughts about a certain data topic (like data vault modeling patterns).
Let’s get started!
Upcoming Data Events
If you know about other relevant events in the near future, please mention them in the comments or send an email to admin@obaysch.net.
2022-02-08: Data Modeling Meetup with Michael Müller (online)
2022-02-09: UK Data Vault User Group with myself (online)
2022-02-16: Snowflake Data Vault User Group on dbtvault (online)
2022-02-22: Fifty Years of Relational with Chris Date (online)
2022-03-08: Data Modeling Meetup with Andrew Foad (online)
2022-03-24/25: Temporal Data in a Fast-Changing World, TEDAMOH (online)
2022-04-06/08: Data Vault Training (CDVDM), Genesee Academy (online)
2022-06-01/03: Knowledge Gap data modeling & data architecture conference (online)
Presentation Modes for Data Marts
Introduction
While data vault is a great choice for the core layer of your data warehouse, a data vault model with its large number of objects isn’t necessarily the most accessible way of presenting data to human users (and other downstream consumers) in the presentation layer. Instead, you should consider the presentation modes outlined in this little series.
For each use case, try to pick the simplest presentation mode that meets your requirements. If current snapshots are all your consumers need, don’t give them bitemporal historization.
Categorizations & Considerations
Presentation modes can be categorized in two ways:
which modeling paradigm they follow (flat tables, Kimball-style dimensional, temporal dimensions, …) and
which historization approach they use (current snapshot, history of snapshots, current history of changes, …).
From this rough categorization, you can already see that there is a greater variety of presentation modes than many people are aware of. Armed with this knowledge, you are able to make informed choices instead of investing a lot of effort in building fact tables and SCD2 dimensions for everything just because that’s the only presentation mode you know.
Finally, you should not only consider how to present your data but also what data to present. While many consumers value all the integration and enrichment work you have done in the core layer of your data warehouse, some don’t and just want quick access to raw source data. Don’t waste your time cleansing and refining data for the latter group.
Modeling Paradigm 1: Flat
Putting all the data needed by a consumer or group of consumers in one big flat table is definitely the simplest approach for modeling presentation layer objects. It has some limitations and might feel strange to data warehouse developers used to dimensional models but more often than not, it just works. Many consumers like it because they don’t have to understand which tables to join and how to join them.
However, before you give up on structuring your presentation layer and just put everything you expose to the outside world into flat tables, you should consider the following points:
If your only presentation layer object is a flat table based on transactions, make sure that your consumers aren’t interested in e. g. products that haven’t been sold because they won’t see them in your flat table. If they are interested in master data like that, it might be wise to use one of the other modeling paradigms.
Some of the historization approaches outlined here don’t really work with flat tables. Make sure you are still able to fulfill your historization requirements.
Maybe your consumers want some room for creativity and value the greater flexibility offered by more sophisticated modeling paradigms.
Maybe your consumers have been working with dimensional models for many years and just don’t want to trade them for big unruly flat tables.
Maybe your consumers use BI tools that work better (or only) with dimensional models.
Maybe you want to build a relationship with your consumers and get a feeling for their needs. Building a dimensional model together could be a great opportunity for that.
Maybe you want to sustain a certain level of data modeling literacy in your organization. If people aren’t even exposed to dimensional models anymore, this might be challenging.
If none of these points are an issue for at least an important subset of your consumers, you should seriously consider flat tables as a delivery mechanism.
Modeling Paradigm 2: Dimensional
While flat tables are gaining popularity, dimensional modeling is still the classic modeling paradigm for the presentation layer. It occupies a sweet spot between paradigms like data vault on the one hand (more flexible but also more complicated) and flat tables on the other hand (even simpler but not flexible at all).
Fact Tables
Dimensional modeling is based on dividing your data into two types of tables (or views): fact tables and dimension tables. Fact tables store all kinds of numerical measurements (revenue, cost, number of visits, …). These measurements are accompanied by foreign keys to dimension tables that store their descriptive context (customer, employee, store, …). If your try to compare these table types to data vault constructs, a fact table is similar to a link with measurements added to it and a dimension table is similar to a hub-satellite hybrid with measurements removed.
For different use cases, there are different types of fact tables:
A transaction fact table stores measurements at the lowest level of granularity, usually a single transaction or its line items. The classic example is the beep of a scanner cash register at a supermarket that corresponds to a product sold to a customer by the employee operating the cash register at a sales event in a specific store at a certain date.
A periodic snapshot fact table stores stores measurements that have been aggregated over a certain time period (like a day, week or month). The periodic snapshot counterpart to the classic cash register beep example would be supermarket sales aggregated by month, product and store.
An accumulating snapshot fact table follows a business process through its standard milestones and stores measurements along the way. The classic example would be the processing of an order. A record in the fact table is created when a customer places an order and updated e. g. when the ordered item is ready for shipment, has been shipped, has been received by the customer and has been paid by the customer.
A factless fact table is a fact table without measurements (and therefore looks very much like a data vault link). An example would be a fact table monitoring student attendance. You want to store which student was present at a certain class on a certain day but you don’t want to record the exact number of minutes the student was present (or the number of times the student fell asleep during the class).
Dimension Tables
Dimension tables are usually based on core business concepts at the lowest level of granularity. There might be a customer dimension that stores customer data including all categorizations and groupings of customers that might be relevant for analysis. Where necessary, you can create dimensions at a higher level of granularity (e. g. for use in a periodic snapshot fact table that records sales by product category and customer type).
In addition to these concept-based dimensions, most dimensional models include a special time-based dimension as well: the date dimension. When analyzing data, it is used for all kinds of temporal groupings (to compare different months, different days of the week and so on). Contrary to other dimensions, it is usually generated instead of loaded from source systems.
In dimensional modeling literature, a distinction is made between several different types of dimension tables depending on the historization approach used and other structural factors. Somewhat opaquely, these types are called slowly changing dimensions (or SCD for short) type 0, 1, 2, 3 and so on.
While most SCD types have lost their relevance in times of cheap storage, you might sometimes encounter SCD1 (corresponding to a current snapshot) and SCD2 (corresponding to a current history of changes). For an overview of relevant historization approaches, see here.
Each dimension table contains a surrogate identifier for its records. Traditionally, these surrogate identifiers have been sequential integers but in recent years, hash keys have become more popular because they can be readily taken over from an underlying hash-based data vault model.
Modeling Paradigm 3: Temporal Dimensions
There is a recent variation on dimensional modeling called temporal dimensional modeling that gets rid of the zoo of slowly changing dimension types by splitting up dimension tables into an immutable part (similar to a hub) and a mutable part (similar to a satellite).
While not yet widely applied, temporal dimensional models are an interesting new option for the presentation layer of a data warehouse based on data vault. They can easily be derived from a data vault model and are almost as easy to use as a regular dimensional model. If necessary, flat-table or current-snapshot dimensional views can be put on a temporal dimensional model for further simplification.
If you want to learn more about temporal dimensional modeling, you can read the original paper here.
Outlook
As mentioned earlier in this article, there are different historization approaches of various usefulness for the presentation layer. We’ll look at them in one of the next issues of Model Your Reality.