Data warehouse history types
This is the fifth post in a series of twelve in Tim Fords Entry-Level Content Challenge.
In this post I will describe what the most common data warehouse history types are, how they are implemented, and some typical use cases for them.
Why establish data change history in the data warehouse?
One of the arguments for building a data warehouse is to establish data history for data, even when it isn’t implemented in the source systems. There can be many reasons for not implementing data audit trails in the data models in OLTP systems – degrading write performance, adding complexity to data models and code (could be handled using SQL Server 2016 temporal), and storage cost to name a few.
Since we take copies of the OLTP data before we load it into the data warehouse, we can also load data as it looked at that time. In this way, we can answer questions like “how much did sales person X account for in 2015, when she was part of team A (when we also know that she moved to team B in 2016)?”
When designing a data warehouse, we typically have to decide which history type we need for each attribute in every entity (dimension). This is an important example of metadata (data about data).
In the following discussion, I will refer for things like staging areas, EDW and datamart layers. If that is unknown (no pun intended) to you, you might want to take a look at my post on data warehouse architecture here: http://www.pontop.dk/#!Data-warehousing-101-what-are-data-warehouses-why-do-we-build-them-and-how/whpr3/56b78f2a0cf2062bd41b6280
Types of history and when to use them
There exist many patterns for data history within data warehousing. Here are some of the most commonly used history types (the types are named Ti, where i is a positive integer. I guess that T stands for Type):
T0 – never update
The value for an attribute with T0 history is never changed once it has been inserted.
An attribute containing code values could be a candidate for T0 history (if these never change)
T1 – update in place
The value for an attribute with T1 history is always updated, if data arrives with new values. T1 history is not really history, since we only keep the latest version of the data for the attribute.
If you don’t need data history for an attribute, then use T1 history for it.
T2 – make a new row
The table for an entity with at least one T2 attribute will typically have two date attributes called something like ValidFromDate and ValidToDate (and sometimes a computed column IsCurrent to denote which row hold the current version of the data for an instance of the entity). When we load data into the data warehouse, we check if attributes of T2 type have values different from the current row. If so, then we insert a new row, set the StartDate to the date (and possibly time of day) where the data row should be valid from. The insert of a new row is sometimes called a T2 split. We also set the ValidToDate on the former current row to this date (and update the IsCurrent attribute to False).
T2 history is a way to track how attributes for an instance of an entity looked at a particular point in time. We can reference this version of the instance in fact tables and/or other dimension tables (if our EDW is snow-flaked).
T3 – make a new column
If an attribute A has T3, we add a new column A_prev to the dimension table (the names have been changed to protect the innocent…). When new data arrives, we overwrite the value of the A attribute and update the value of the Aa to hold the former value of A.
The standard Kimball use case for T3 history is keeping track of sales territory for a sales person. To calculate sales bonuses, you want to be able to see not only how well the person is doing now, but also how well she did in her former position.
Notice that if you already have T2 history for an attribute, you can easily convert this to T3 history. One pattern is to combine the two and have T2 history in the EDW layer, and T3 in datamarts populated from the EDW.
Comparing DSA and EDW data - Business keys to the rescue
In order to implement history for one or more attributes for an entity, we first need to be able to compare a row in the staging area with rows in an EDW table and deduce whether this is data from a new instance of the EDW entity or if we have data for it and we might need to update one or more attribute values. In data warehousing, we use business keys for this. We add a business key attribute (for simplicity we just operate with one) to every EDW table, and the value of this attribute is our uniquefier (in case we have T2 attributes, it is the combination of BusinessKey and ValidFromDate which must be unique).
Using the business key, we can now join the staging and EDW tables and deduce which staging rows should be inserted as new instances of the EDW entity, and which rows should be used for history updates.
System time or business time?
In T2 history we have the two attributes ValidFromDate and ValidToDate. We can choose two different strategies for updating the values of these: using system time (load time) or business time. If we use system time for the T2 splits, the data warehouse history is dependent on when we load data. This makes it impossible to reload data in the data warehouse without messing up the data history. If we allow our load ETL procedures to use timestamps for business time (when data was really valid) for T2 history, we get the opportunity to reload data. But the cost of this flexibility is a much more complicated design for T2 splits. We also need to keep track of this metadata on the source system attributes.
Where should I store attribute history metadata?
Since SQL server doesn’t support data warehouse history types, you need to store the metadata (Entity, Attribute, History type) somewhere where your ETL generator can access it (you are being naughty if you code ETL by hand…) I have seen people using Excel spreadsheets, I have seen metadata saved in database tables, and I have seen it stored as extended properties on the physical entity tables. All three approaches have pros and cons: Excel makes it easy to type metadate into (at least for one person), but difficult to parse for the auto generator code, Database tables are friendly towards multiple people manipulating data and towards the auto generator code, but has no out-of-the-box UI for data entry. Extended properties have neither a good data entry interface (maybe except SSMS) and are not super easy to read for a program, but have the nice property that they live directly together with the physical representation of data warehouse tables.
Why not use SQL Server 2016 temporal?
Short answer: because SQL Server 2016 temporal only support system time history. Long answer: I wrote a full blog post on this topic back in November 2015. Read more about it here:
More history types
There are many more patterns for data warehouse history out there. This blog post is for the curious beginner, and I only cover the basic patterns described above. Read more about other data warehouse history change types (and hybrid patterns here:
Or get a copy of Kimballs ETL design handbook (a real book made of paper) if you want to dive more into the topic:
Well, that’s it folks… My day job is working to build data warehouse generation tools and teaching our consultants data warehouse concepts. Please ping me if you want more of this content.
These are the other posts that I wrote in the 2016 entry level series:
Data warehousing 101 - what are data warehouses, why do we build them and how? http://www.pontop.dk/#!Data-warehousing-101-what-are-data-warehouses-why-do-we-build-them-and-how/whpr3/56b78f2a0cf2062bd41b6280
Hyper-V 101 - Your own local data center http://www.pontop.dk/#!HyperV-101-Your-own-local-data-center/whpr3/56c78ff30cf25df9372037c6
Database joins 101 - introduction to join algorithms http://www.pontop.dk/#!Database-joins-101-introduction-to-join-algorithms/whpr3/56db1e330cf2bc6add1a719d
Doing arithmetic with nully data http://www.pontop.dk/#!Doing-arithmetic-with-nully-data/whpr3/5710a46d0cf21d11795af0bc