Introduction
This article is a primer for anyone who has not set up a data warehouse to read date directly from a relational source. Seasoned data warehouse experts can safely move on to the next article.
It would never have occured to me to write an article on this topic until I worked at a company whose data warehouse was loaded entirely from COBOL files extracted from an old mainframe system. The project introduced a new relational source and the people in-house found it a novel idea to directly pull from a source with more options than a file offers. The opportunity arises in this situation to read only the changes since the last data pull rather than getting an entire dump in files. This reduces both network traffic and load on the source system, sometimes quite significantly. Both of these bonuses appeal to their respective owners.
It may not be immediately obvious or intuitive to the non-data warehouse jaded how to go about this. This article will go over a basic example. Enough to get the wheels spinning, so to speak. A lot can be added to the example for additional auditing and metrics trending purposes.
The Situation
The idea is to read from a source to which one can submit a relational query in the general form of "select from source where record modified (or created) date is between the prior run date and now." The data warehouse will then keep track of the date and time of reading and the next pull will use the former "now" date as the "prior run date" and the current time as the "now" date.
Obviously, this requires the source provide a date or timestamp per record, created and updated with the contents of each record. Thus when a record is inserted, it contains a field indicating the date and time of insertion and when the same record is modified, a field containing the date and time of the most recent update. A single field can be dual purposed for this, so that upon insertion it is the create date but after subsequent updates it is updated to hold those times. Fortunately, most relational systems these days either already have such a field or it is not prohibitive to have one added. If this is not the situation with your date warehouse's source systems, sorry, you are outside this article's target demographic.
The Example
Now let's get to the actual example. We will need some test tables set up. First, we need a source from which our data warehouse pulls data. Let's imagine a simple sales system, with offices and sales per office:
CREATE TABLE SALES_OFFICE( ID INT PRIMARY KEY ,OFFICE_NAME NVARCHAR(100) ,MODIFY_DATE DATETIME ); CREATE TABLE INVOICE( INVOICE_ID INT PRIMARY KEY ,OFFICE_ID INT REFERENCES SALES_OFFICE ,TOTAL_SALE DECIMAL(6,2) ,MODIFY_DATE DATETIME );
The foreign key reference may seem excessive for such a small example but it will illustrate an important feature later.
Let's populate these with a few rows of data to get started:
INSERT INTO SALES_OFFICE(ID, OFFICE_NAME, MODIFY_DATE) VALUES (1,'SOUTH',GETDATE()-10); INSERT INTO INVOICE(INVOICE_ID,OFFICE_ID,TOTAL_SALE,MODIFY_DATE) VALUES (1001,1,25.00,GETDATE()-5); INSERT INTO INVOICE(INVOICE_ID,OFFICE_ID,TOTAL_SALE,MODIFY_DATE) VALUES (1002,1,35.00,GETDATE()-1);
Now we have one office whose salespersons have made two sales. The modify dates in this example are overloaded so as to be both create date and modify date. The sample records are set some days back in time just to emphasize they exist well before the data warehouse comes along for the first time.
We need a data warehouse into which the source records are loaded. Here is a simple fact and dimension based around our sales:
CREATE TABLE DW_DIM_OFFICES( OFFICE_ID INT PRIMARY KEY ,OFFICE_NAME NVARCHAR(100) ); CREATE TABLE DW_FACT_SALES( OFFICE_ID INT REFERENCES DW_DIM_OFFICES ,SALES_AMT DECIMAL(6,2) ,SALE_DATE DATETIME );
Our simple system will assume the modify date is the sale date to avoid cluttering up the example with extra fields. Either that or the data modeller of our source has some explaining to do! Also, we have a foreign key reference whose importance will be explained later. Bonus points if you can guess what part it plays in our example of using modify dates when reading from a relational source.
We have our source and our data warehouse; all that is lacking a place to track the modify dates. Here is a simple date tracking table for our project:
CREATE TABLE LOAD_RANGE_TRACKING( SOURCE_SYSTEM_NAME NVARCHAR(50) ,BEGIN_RANGE DATETIME ,END_RANGE DATETIME );
This table serves all the basics. There need be nothing in it when we use it the first time if we script carefully. More complex systems often make their way into actual systems. We could include runtimes, row counts, and other metrics for a variety of reasons. This table can be expanded to serve several purposes related to job runtimes.
Queries to our source will use the values from BEGIN_RANGE and END_RANGE in the WHERE clause to bracket the source's MODIFY_DATE field. The following script runs the work in this example but hopefully you have a proper ETL tool such as SSIS, Informatica, Ab Initio, etc, in the real world to perform this work instead of a script.
The System
Explaination of the script's workings; please follow along in the script below.
Rows 1 - 3: The first part simply sets up a couple of variables so we can hold and re-use the BEGIN_RANGE and END_RANGE values.
Rows 4 - 12: The script checks to see if a record for our sales system, called "OUR POS", exists in the tracking table. If it does not, an initial record is entered. Bonus points if you know why I've chosen the initial read range start as 1/1/1754. If a record already exists, it moves the current END_RANGE date to BEGIN_RANGE and uses the current system time as the new END_RANGE.
Row 13: The BEGIN_RANGE and END_RANGE dates thus determined, the script loads the sales office data into the data warehouse dimension table.
Rows 14-16: Now, a suprise! A user at the source system inserts data in the midst of our data warehouse load process. In the old days with a flat file, this would not be a concern. But now we are only halfway through reading the source tables and will find ourselves with an orphaned fact table record if we proceed! Or will we? No, this is where a powerful hidden feature of this system is revealed. By "freezing" our load range per source system (did you notice the load range table stored the dates in a single row for the source and not per table?) we will get neither the orphaned records from the new sales office nor the new sale for the pre-existing office.
Rows 17-18: The fact table loads the rows as of the pre-selected date range.
The Script
--Rows 1-3 BEGIN DECLARE @BEGIN_RANGE DATETIME; DECLARE @END_RANGE DATETIME; --Rows 4-12 IF NOT EXISTS (SELECT SOURCE_SYSTEM_NAME FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS') BEGIN INSERT INTO LOAD_RANGE_TRACKING(SOURCE_SYSTEM_NAME,BEGIN_RANGE,END_RANGE) VALUES ('OUR POS','1/1/1754',GETDATE()); SET @BEGIN_RANGE = (SELECT BEGIN_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); SET @END_RANGE = (SELECT END_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); END ELSE BEGIN UPDATE LOAD_RANGE_TRACKING SET BEGIN_RANGE = END_RANGE ,END_RANGE = GETDATE() WHERE SOURCE_SYSTEM_NAME = 'OUR POS'; SET @BEGIN_RANGE = (SELECT BEGIN_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); SET @END_RANGE = (SELECT END_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); END --Row 13 INSERT INTO DW_DIM_OFFICES(OFFICE_ID,OFFICE_NAME) SELECT ID,OFFICE_NAME FROM SALES_OFFICE WHERE MODIFY_DATE > @BEGIN_RANGE AND MODIFY_DATE <= @END_RANGE --Rows 14-16 INSERT INTO INVOICE(INVOICE_ID, OFFICE_ID,TOTAL_SALE,MODIFY_DATE) VALUES (1003,1,45.00,GETDATE()) INSERT INTO SALES_OFFICE(ID, OFFICE_NAME,MODIFY_DATE) VALUES (2,'NORTH',GETDATE()) INSERT INTO INVOICE(INVOICE_ID, OFFICE_ID,TOTAL_SALE,M ODIFY_DATE) VALUES (1004,2,55.00,GETDATE()) --Rows 17-18 INSERT INTO DW_FACT_SALES(OFFICE_ID,SALES_AMT,SALE_DATE) SELECT OFFICE_ID, TOTAL_SALE, MODIFY_DATE FROM INVOICE WHERE MODIFY_DATE > @BEGIN_RANGE AND MODIFY_DATE <= @END_RANGE END;
The Results
Run the script all at once. Now let's see what ended up in the various tables involved. The load range tracking table has the dates showing the predicate used when reading the source:
SELECT * FROM LOAD_RANGE_TRACKING;
The sales system has all of its data, what was read into the warehouse as well as the records inserted midway through our read process:
SELECT * FROM SALES_OFFICE A INNER JOIN INVOICE B ON A.ID = B.OFFICE_ID;
The data warehouse is able to present a nice report with no orphaned data and complete with the time as of which the data came from the source. Business users appreciate this not so minor detail!
SELECT B.OFFICE_NAME , SUM(A.SALES_AMT) AS TOTAL_SALES , 'Report data current as of ' + CONVERT(NVARCHAR(19),(SELECT END_RANGE FROM LOAD_RANGE_TRACKING AS DATE_OF_DATA WHERE SOURCE_SYSTEM_NAME = 'OUR POS')) AS DATA_DATE FROM DW_FACT_SALES A INNER JOIN DW_DIM_OFFICES B ON A.OFFICE_ID = B.OFFICE_ID GROUP BY B.OFFICE_NAME;
The Followup
Now if you run the script again without the midway source insertions, the inserts to the data warehouse tables will be only the records inserted to the source since our very start. Obviously the difference with reading this many rows doesn't matter but imagine a busy source, perhaps one connected to a web front end, that may contain hundreds of millions of records in its tables and collects hundreds of thousands more between data warehouse loads. In reading only the modified records, we save both the source RDBMS and our internal network a lot of overhead with this simple scheme.
Here is the script without the midway source insertions.
BEGIN DECLARE @BEGIN_RANGE DATETIME; DECLARE @END_RANGE DATETIME; IF NOT EXISTS (SELECT SOURCE_SYSTEM_NAME FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS') BEGIN INSERT INTO LOAD_RANGE_TRACKING(SOURCE_SYSTEM_NAME,BEGIN_RANGE,END_RANGE VALUES ('OUR POS','1/1/1754',GETDATE()); SET @BEGIN_RANGE = (SELECT BEGIN_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); SET @END_RANGE = (SELECT END_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); END ELSE BEGIN UPDATE LOAD_RANGE_TRACKING SET BEGIN_RANGE = END_RANGE, END_RANGE = GETDATE() WHERE SOURCE_SYSTEM_NAME = 'OUR POS'; SET @BEGIN_RANGE = (SELECT BEGIN_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); SET @END_RANGE = (SELECT END_RANGE FROM LOAD_RANGE_TRACKING WHERE SOURCE_SYSTEM_NAME = 'OUR POS'); END INSERT INTO DW_DIM_OFFICES(OFFICE_ID,OFFICE_NAME) SELECT ID,OFFICE_NAME FROM SALES_OFFICE WHERE MODIFY_DATE > @BEGIN_RANGE AND MODIFY_DATE <= @END_RANGE INSERT INTO DW_FACT_SALES(OFFICE_ID,SALES_AMT,SALE_DATE) SELECT OFFICE_ID, TOTAL_SALE, MODIFY_DATE FROM INVOICE WHERE MODIFY_DATE > @BEGIN_RANGE AND MODIFY_DATE <= @END_RANGE END;
Re-run the selects for the load range tracking table and note the new values. Re-run the data warehouse report script and see both sales offices and their new totals and data as-of dates.
The Conclusion and Additional Thoughts
This is a simple example of how to use a relational source's ability to provide record modify dates and a way to predicate reading based on those dates. There are several problems to be aware of. We avoided reading into the data warehouse a child level record inserted successfully along with the parent on the source. However, what if the existing sale record were modified? Our data warehouse would then have a parent record with no child. From a foreign key relational point of view, the data would be loaded successfully but this may cause problems for our report! A good ETL tool could submit parallel read requests for all the needed tables at almost the same instant, cutting down on this risk quite a bit.
Other problems arise around updates and deletes; too many to go in to here. The best way to avoid all these concerns are by having the source use a change data capture system. SSC already has some good articles about CDC that you may want to read if this is an option.
If CDC is not an option, tread carefully and keep insert/update/delete timing for parent/child tables along with your ETL system's capabilities in mind. As always, test carefully and make sure downstream users are aware of the risks.