Data warehousing 101 - what are data warehouses, why do we build them and how?
This is the first post in a series of twelve in Tim Fords Entry-Level Content Challenge.
In this post I will describe what traditional data warehouse is, why we build them and some typical architectural choices in a data warehouse
What is a data warehouse?
A data warehouse is a database, where data from various line-of-business systems is copied, cleaned, enriched, integrated and transformed into information in a form, which makes it easy for a business user to answer questions she need answers to.
Why build a data warehouse?
When we first learn about databases and design theory, a lot of emphasis is put on normalization in order to remove redundancy. So why on earth do we replicate data to a data warehouse and not just report directly on the databases underneath our line-of-business systems? There are several arguments for this:
Reporting queries can hurt performance on the line-of-business system(s)
Databases underneath line-of-business system might not implement data history
The datamodels underneath line-of-business systems are designed in a way, which supports On Line Transaction Processing (OLTP) types of interactions (this is where we talk about normalization, third or higher normal forms, BCNF normal form and no redundancy). These models are hard for business users to understand
To illustrate why we typically have different layers in a data warehouse architecture, let’s consider a health care organization “Health, Inc.” with a central administration and a number of hospitals. Each hospital has a number of clinics. Health Inc. use a central Electronic Patient Records it-system to store data about patients (EPR), as well as a number of line-of-business systems for radiology (RADIOLOGY), blood sample tests (BLOOD) and tissue sample tests (TISSUE).
Data warehouse architecture version 1 - Naïve attempt
Central administration wants to do reporting on the EPR to support an initiative for better usage of resources. They get access to the EPR database and use Excel/Crystal reports to create reports.
During the project, they experience
It was very hard to understand the EPR datamodel, and even the IT staff didn’t have documentation on in which tables data was stored and how it was related
Making new reports required a lot of understanding of the EPR data model
When they run the reports during working hours, clinical staff (the users of the EPR) complain about very long response times
Data warehouse architecture version 2 – adding a staging layer
The problem with EPR performance for the clinical staff during report generation comes because the two uses of data have two very different query behaviors
clinical staff use of EPR is typically about looking up a patient and editing the record for that patient. This is a typical OLTP query pattern – many, small, independent transactions
Central administration reports use of EPR data is typically about reading all of patient records and do calculations on them. This is a typical data warehouse query pattern – few, big transactions on full table content
And these two query behaviors compete for resources on the EPR database. Therefore, it is very common in a data warehouse architecture to introduce a second database (server) for the data warehouse, and copy the source system data to a so-called data staging layer (DSA) on this new database. We can now copy data from EPR to this server during low peek time, and do our reporting queries here.
Data warehouse architecture version 3 – make a dimensional model
In phase one of the BI project, Health, Inc learned that it was very hard to understand the EPR data model and it required a lot of understanding of the EPR data model to create new reports. Having a DSA layer only helped solving the performance problems, it didn’t help us out on the data model problems. Matters get worse when we start phase two of the BI project – combine EPR data with blood sample test data from the BLOOD system. We now both need to understand the new data model from the BLOOD system, and also make integration queries between data from EPR and BLOOD.
One solution to this mess is to create a completely new data model in a new data warehouse layer. We call this the Enterprise Data Warehouse (EDW). By introducing the EDW layer, we can isolate the data integration problems in the border between the DSA and EDW layers.
When we introduce a new data model, we also choose to make this model more understandable to our business users. A commonly recognized modelling technique is called dimensional modelling. A dimensional model lends itself nicely towards a pivot-table kind of reporting, which is much easier to work with in Online Analytical Processing (OLAP) scenarios.
We sometimes also call the EDW, the single version of the truth (this can be very difficult to achieve across an organisation).
Data warehouse architecture version 4 – data marts for different reporting purposes
In phase three of the BI projects, Health Inc. wants to roll out the BI solution to the different hospitals and also to selected clinics within hospitals. As part of the roll out, some types of reporting needs to be changed going down in the organizational hierarchy, and some consideration must be taken in order for controlling the right level of data granularity for organizational units. Maybe the central administration needs only to see data aggregated to the hospital level, whereas clinics need to see their own data detailed, but aggregated for their peers. We could put all this business logic in the presentation tool. But then we also limit ourselves to one presentation type of visualization tool, maybe we would like to pick different visualization tools for different roles. An architectural solution to this is to establish a data mart layer (DM) in the data warehouse, and encapsulate all this logic in the mappings between the EDW and the DM layers:
One size fits all?
So, this is one possible architecture to use for your data warehouse project. Is it needed for every type of BI project out there? Probably not. If you just need stove-pipe reporting on your ERP system, go with a simpler architecture. Do we need to persist data in all these layers? No, not necessarily. You might use database views for objects in some layers, especially data marts can be implemented using views. It seems complicated, doesn’t it make my data warehouse/BI project overly complicated and expensive to use an architecture like this? No, not necessarily. You could use data warehouse automation tools/frameworks like Effektor or Biml to establish this architecture. More on data warehouse automation some other time.