To populate the data warehouse we need to get data from the source systems.
The source system could be an RDBMS, it could be a file based system (including
XML), it could be a spreadsheet, or it could be something else (a web service,
a web page, a document, etc). In this article we discuss RDBMS source system,
i.e. the source data is located in a table on SQL Server, Oracle, Informix,
DB2, Access, etc. File based source systems and spreadsheets will be discussed
in a separate article.
There are 3 main methods to extract a table out of an RDBMS source system into
our data warehouse: get the whole table every time, get it incrementally (only
the changes), or using a fixed period. Each of these 3 basic approaches will
be explained below one by one.
There are several considerations when choosing which method to use. The first
and foremost is normally the time window, i.e. the data extraction process should
be completed in certain amount of time. The less time it requires to do data
extraction, the better it is for a data warehouse system, because this means
that we can update the data warehouse more frequently. And it generally means
that we put less load on the source system.
The second consideration is practicality, i.e. what are possible and what are
not possible. Ideally we would like to extract it incrementally, but if it is
not possible then we will have to use other method. Or the other way around:
ideally we would like to extract the whole table (just to be sure), but if there
are 100 millions records in the table, then we may have to do it incrementally
otherwise the extract could take a very long time.
You can read Part 2 here.
1. Whole Table Every time
The simplest way to download a table from the source system into data warehouse
is to extract all records every time, e.g. "SELECT
* FROM table1". For example, a store table. Say a retail group has
500 stores. Each store has a record on this table containing the store name,
address and various attributes such as store grade or classification, last inspection
date and ID of the store manager. For 500 records, even if the store table has
timestamp for both create and update, it is probably quicker to get the entire
table every time.
If we just do a select without a where clause, the extraction begins instantly.
If we put something on the where clause, such as timestamp, the extraction process
begins a few moments later (it could be a second later, it could be a few minutes
later, depending on how many records are on the table on whether the columns
used on the where clause are indexed or not). If you are connected on a LAN
with the source system, say 100 Meg, 58 records should take less than a second.
So, if the delay caused by the where clause is a few seconds, it is quicker
to download the whole table every time.
A quick test using an ETL software can easily measure how long the delay is.
But as a guideline, for a table that has less than 1000 rows, we should definitely
consider getting the whole table every time. If the table has 1000 to 10,000
rows, we should still probably consider getting the whole table every time.
If the table has more than 1 million rows , we should definitely consider downloading
it incrementally (unless the record length is very small, such as 10 bytes).
One sure way of determining whether download the whole table every time is suitable
is to do it and measure the time. If the time it takes is not possible to be
accommodated, then we need to extract the data incrementally.
2. Incremental Extract
If the table contains 2 million records, say it is the customer table, it could
take 15-20 minutes to download the whole table. The purpose giving the stats/numbers
here is only for illustration, so we can get the feel of it. It is not exact
and it is not scientific. It could vary significantly, for example if you are
using RAID 10 or using Gigabyte network. OK, back to customer table. If there
are timestamp columns for creation and updates, then it could be a sensible
idea to try to download this table incrementally, which may decrease the download
time to, say, 3-5 minutes. The number of records decreases from 2 million to
2000 (daily changes only) and the amount of data transferred decreases from
2 GB to 2 MB (assuming the record length is 1000 bytes).
In OLTP systems, customer and product are probably the biggest master tables.
But transaction tables are normally a lot larger. It does depends on the size
of the company (multinational or local), and the industry (telecommunication
and retail normally have huge tables). A transaction table with 100 million
rows are not uncommon. At early stages of building the data warehouse, the one
of the key question to answer is whether it is possible to extract the transaction
tables incrementally. It is often not practical (in terms of time window) to
extract the whole table every time. This question is so important so that for
many projects. So important that if the answer is no then the project could
not proceed.
There are 6 main ways to perform incremental extract:
a. Using timestamp columns
b. Using identity column
c. Using triggers
d. Using transaction date
e. Combination of the above
f. Using off-the-shelf tools
a. Incremental Extract Using Timestamp Columns
If the table is large (>100,000), it is always worth to find out if the
table has creation and update timestamp or datestamp columns, and whether the
timestamps are indexed. And more importantly if the timestamps are updated every
time a new record is created and updated. The index is a simple and straightforward.
We could easily prove if a column is indexed or not. Your source system DBA
would know how to. Different RDBMS has different implementation for indexing
(some platforms call it logical file), but they all speed up the query. Adding
the timestamp on the where clause of the select statement could greatly vary
the extraction time, especially for tables with millions of rows. Sadly, more
often than not, the timestamp columns are not indexed. In this case it may be
worth to find out if they could index it for you. But before you put forward
the request consider the impact to the source system - as you know the more
index there are on a table, the slower the DML e.g. insert, update and delete
process.
The second one is a little bit more difficult to find out: whether the timestamps
are updated every time a new record is created and every time an existing record
is updated. The supplier of the ERP system may confidently say that the timestamps
are always updated on insert and update throughout the application, but your
internal IT colleagues may issue a direct DML statement to the database. "It
is just a normal annual exercise every beginning of fiscal year", so you
may hear from them. Even if the timestamps are maintained by triggers, we can
not be 100% scientifically sure that the timestamps are always kept up-to-date.
Step 4 of the 'normal annual exercise' could be disabling the triggers on that
table!
To be 100% certain, put the table on daily extract based on timestamp and leave
it running for a few weeks. This period does depend on how frequent the table
is updated. If the table is updated daily then a few weeks is a sensible evaluation
period. After this period, compare the table on the data warehouse with the
table on the source system using the primary keys. Find out if there is a "leak",
i.e. records which exist on the source system but you can not find on your DW.
If you find a leak, check your logic on the date movement. Also if the date
window moved when the download process failed. If there is no leak, we need
to check that the updates are working, i.e. all columns on all rows on DW table
matches those on the source system. We do this by using CRC comparison. The
3rd thing we need to check is if there are any deletion on the source system.
This is done using the same means as checking for "leak" above, in
fact they are both done using the same process. Deletion is identified by finding
out if there are records which exist on the data warehouse but not on the source
system. If these 3 checks are satisfied, then we can use the timestamps columns
for incremental loading.
Once the timestamp columns are tested to our satisfaction, we can use them
for incremental extract. The basic syntax for the select statement is: "SELECT
* FROM table1 WHERE date_col > LSET". LSET = Last successful
extract time. Before the extract begins, we store the data warehouse system
time in a variable. If the extract is successful, we store this variable on
data warehouse control database as LSET. If the extract fails, it is important
that we don't store this variable on the database. Next time the extract runs
it will use the stored LSET so it will only get the records since the last successful
extract.
It is important to set the LSET to the time before the extraction begins, not
when the extraction finishes. And it is important to limit the maximum time
we want to extract to. For example: if the data extraction routine runs at 1
am for 10 minutes, and if the data extraction routine is running daily, then
we want to limit the time window for today's data extraction from 09/3/06 01:00:00
to 10/03/06 01:00:00. We do not want records created at 10/03/06 01:00:01, because
this record is for tomorrow's extract. Thanks to Steve Wright from C&C group
for making me aware of the maximum limit. So the where clause becomes: "SELECT
* FROM table1 WHERE date_col > LSET and col1 <= CET", where
CET = Current extract time, i.e. time before the extraction began.
The reason why we don't want records created or updated after the data extraction
began is because we can not guarantee if the record will be extracted or not.
Say for instance we have a record with timestamp of 10/03/06 01:05:00. Data
extract is running for 10 minutes, from 10/03/06 01:00:00 to 10/03/06 01:10:00.
If our select statement is "SELECT * FROM table1
WHERE date_col > 10/03/06 01:00:00" (without the CET), will we
get that record? It depends on the ISOLATION LEVEL, or CONCURRENCY CONTROL of
the source system. Different RDBMS have different implementations of isolation
level: IBM
DB2 on z/OS, IBM
DB2 UDB on iSeries, Microsoft
SQL Server 2005, Oracle
10g, IBM
Informix DS10.
So in the data warehouse control system for each table we are extracting, we
have 2 timestamps defining the range of data extraction: Last Successful Extract
Time and Current Extract Time. In the previous example the LSET is 09/3/06 01:00:00
and the CET is 10/03/06 01:00:00. If the data extraction is successful (for
this particular table), then we set the LSET = CET, so both of them becomes
10/3/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06
01:00:00 and it will extract from 10/3/06 01:00:00 to 11/03/06 01:00:00. On
the other hand, if the extract fails, we don't set the LSET = CET, so the LSET
would still be 09/3/06 01:00:00 and the CET would still be 09/03/06 01:00:00.
Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will
extract from 09/3/06 01:00:00 to 11/03/06 01:00:00.
If the date column is separated from time column, for example date column contains
'28/02/2006' or '28-02-2006' or '2006-02-28' and the time column contains '11.03.21'
or '11:03:21', the algorithm we could use is:
WHERE (date_col > min_date AND date_col < max_date)
-- first segment
OR (date_col = min_date AND time_col > min_time) -- second segment
OR (date_col = max_date AND time_col <= max_time) -- third segment
min_date and min_time are determined from LSET whilst max_date and max_time
are determined from CET.
For example: we want to extract records from 07/03/06 01:00:00 to 10/03/06 01:00:00
(because the data extraction route failed to connect to the source system from
3 days). The first segment above is to get records created/updated on 08/03/06
and 09/03/06. The second segment is to get records created/updated on 07/03/06
after 1 am. The third segment is to get records created/updated on 10/03/06
before 1 am.
It is very very important to ensure that we do not update the LSET if the extraction
fails, otherwise we will loose data. I can not stress this important enough.
If there is only 1 thing you can get from this article, I hope you get this
one: ensure and test that LSET is not updated when the extraction fails.
One apparent weakness of incremental extract using timestamp is that we can
not identify deletion. If the source system is using soft deletion (records
to be deleted are marked), we have nothing to worry about. But if they are using
hard deletion (records to be deleted are physically removed), then we need to
use other technique to identify deletion, for example: using trigger or identity
column. Fortunately, most ERP / business systems do not delete from the main
transaction table, particularly when it is on header-detail format. Instead,
they normally use status column, e.g. order status, call status or transaction
status.
This is because of cascade delete / cascade update on the PK-FK relationship,
or more widely known as "cascading referential integrity constraints".
Basically this means: if the table is a foreign key table (a child table if
you like), you would think twice before deleting rows from that table, because
you have to delete all corresponding rows from the other table (the parent table).
And vice versa: if you want to delete rows from the primary key table (the parent
table if you like) then you need to delete all corresponding rows from the foreign
key table (the child table). Otherwise you will break the referential integrity
constraints.
When the timestamp column is tested for reliability (see 9 paragraphs above,
the one begins with "To be 100% certain ..."), we will know whether
there are deletion involved. It is good to ask the client / system provider
(if there is or there isn't deletion on the transaction table), but we still
have to test it.
One last point on using timestamp for incremental extract: some times we need
to get the timestamp from other table(s), or from combination of this table
(that we want to extract) and other table. For example: order header and order
detail, linked on order number. In early stages, we may think that it is enough
to extract the header table based on the timestamp columns of the header table.
And to extract the detail table based on the timestamp columns of the detail
table. But sometimes it is possible that we will find that it is not enough.
To extract the header table, we may need to use the timestamp columns from both
the header table and the detail table. And to extract the detail table, we also
need to use the timestamp columns from both tables. The code looks like this:
SELECT h.* FROM order_header h
LEFT JOIN order_detail d
ON h.order_number = d.order_number
WHERE (h.timestamp > LSET and h.timestamp <= CET)
OR (d.timestamp > LSET and h.timestamp <= CET)
b. Incremental Extract Using Identity Column
If there is no timestamp column in the transaction table (this is unlikely),
or if the timestamp column is not reliable (this is more likely), then we could
use identity column for incrementally extract the data. Identity column is unique.
We can test the data to make sure there is no duplications as follows: (id_col
being the identity column)
SELECT id_col, count(*) FROM table1
GROUP BY id_col HAVING count(*) > 1
Identity column is sequential. This enables us to easily identify new records.
The basic syntax is WHERE id_col > LSEI. LSEI
= Last successful extract identity, i.e the value of the identity column of
the last record successfully extracted. Again, just like the time window, we
want to limit the maximum. So it becomes WHERE id_col
> LSEI and id_col <= CEI, CEI = current extract identity.
To identify deletion we compare the identity column between the source and
data warehouse. This is done by downloading the identity columns (all records)
from the source system. In many cases, especially if the table is a transactional
table, the identity column is often a primary key, but it's not always. Basically
we identify the missing ones, i.e. rows that exist on source system but not
in the data warehouse. We then delete (or move/archive) these rows from our
data warehouse.
OK, that's new records and deletion. How about updates? To identify updates
we need to download the columns that we want to load to our data warehouse (yes,
all records < LSET unfortunately), and compare these columns with the records
on the data warehouse. If they are different, we update the records in the data
warehouse.
After the source data is downloaded, the basic statement for updating incrementally
is as follows:
UPDATE dw_table dw
SET dw.col1 = src.col1, dw.col2 = src.col2
FROM stage_table src
WHERE dw.col1 <> src.col2, dw.col2 <> src.col2
In most cases they have an archiving or purging system implemented on the ERP
system, resulting in not too many rows on the active transaction table. For
example, only keeping last 12 months data on the order table, resulting in say
5 million rows, with download time of say, 30 minutes. Sometimes they keep many
years records without ever purging them to an archive database, resulting in
(say) 30 million records on the active order detail table. If this is the case,
try to identify whether it is possible to impose a time range for identifying
updates, say last 6 months records. For the purpose of identifying this 6 months
date range, if there is no timestamp / datestamp columns (created_date, last_updated_date,
etc), then transaction date column (order date, delivery date, etc) will do.
This will limit the amount of records downloaded to compare for identifying
updates.
We can also use other mechanism to identify updates, such as update trigger,
update timestamp or log files. The update timestamp doesn't always need to be
from the same table, it could be from other table. For example, in a header
detail relationship.
If the source system is Oracle, the identity column is probably implemented
on primary key column using a trigger and CREATE SEQUENCE, something like this:
In DB2 and SQL Server, identity column is built in. Note that DB2 also support
CREATE SEQUENCE. In Informix it is SERIAL data type.
c. Incremental Extract Using Triggers
Triggers are the most convenient way of doing data extraction. It is the belt
and braces approach. There are 3 types of triggers, e.g. for insert, for update
and for delete. Most RDBMS also differentiate before and after trigger, i.e.
whether the trigger is executed before or after the insert/update/delete. By
creating triggers on the source system table, we can program the source system
to provide us with the records every time there is a new record, when updates
are made to the existing records, or when the records are deleted. We need to
be careful when implementing triggers because it slows down the source system
performance.
One way of implementing triggers in the source system is to keep the primary
keys of the changed table in a specially created table (let's call this delta
table). In the delta table we only keep the primary keys, not the whole record.
This is because different tables have different columns so it is not possible
to create a delta table which can store all tables. In delta table, the primary
keys could be stored in 1 column, using separators. The name of the primary
key columns are stored in another column, also using separators. Delta table
also contains 2 important columns: the creation timestamp column (no need to
have updated timestamp column) and the table name column. This enable us to
get to the right table and to extract it incrementally. The delta table looks
like this:
ID | table_name | PK_columns | PK_values | time_stamp |
1 | inventory | store_id|product_id | 23|5643_G | 17/03/2006 10:10:46 |
2 | order_detail | order_id|line_no | 454AS1|2 | 17/03/2006 10:11:01 |
3 | customer | customer_id | 343 | 17/03/2006 10:11:98 |
The delta table needs to be cleared out when it has been processed. When processing
the delta table, it is important to impose a CET, Current Extract Time, i.e.
the system time before the data extraction begins. So the process is: get all
records from delta table where timestamp <= CET, then delete all records
from delta table where timestamp <= CET. Alternatively (preferred, for trace/history
reason, and also for performance - see having a good PK below), we can also
not clearing delta table when we have extracted them, but we simply store the
CET as LSET (Last Successful Extract Time). See the section on timestamp above
for details. If we don't clear the delta table, we need to implement 1) a good
integer identity (1,1) primary key, such as the ID column in the example above,
and please don't forget to cluster index it for performance, and 2) a good purging
mechanism (say leaving last 3 months data) otherwise the system will be slower
and slower every time - in this case a non cluster index on the time_stamp column
is useful. One note on the time_stamp column if your source system in a SQL
Server, I would prefer to put is column as datetime data type rather than timestamp
data type, for compatibility reasons both with future version and with other
RDBMS (portability of implementation).
Depending on the structure and complexity of the source system, sometimes it
is better to have a separate delta table for each table on the source system,
rather than using a single delta table. It is also common to have the delta
tables extracted to files at certain periods of the day, ready for the data
warehouse to get it.
The main difference between doing data extraction using triggers and the previously
mentioned 2 methods is that if we use trigger, we "touch" the source
system. The word touch may be a little bit too soft, the word "change"
is probably more appropriate. Yes, we change the source system. This is not
a luxury that everyone have. In many cases, the source system is an off-the-shelf
ERP system, e.g. SAP, Oracle 11i, Sage, MFG, JD Edwards, PeopleSoft, Axapta,
Sieble CRM, SalesForce, etc. In these cases, most probably, we don't have the
luxury of modifying the source system as we like. If we ask the supplier to
modify it for us, we are risking ourselves to be in the position where our ERP
is not a standard version any more, and therefore a) not upgradable to the next
version, and b) not easily supported by other software when interfacing. Luckily,
the big players in the ERP market such as SAP and Oracle already prepared themselves
and have their own data warehousing tools. Tools such as Oracle Change Data
Capture can extract data incrementally, and it is implemented either using triggers
(synchronous) or using redo log file (asynchronous).
This is the end of part 1. In part 2 we will discuss:
2c. Incremental Extract Using Transaction Date
2d. Incremental Extract Using combination of the above
2e. Incremental Extract Using Off-The-Shelf Tools
3. Fixed Period Extract
Vincent Rainardi
23/03/06
You can read Part 2 here.