Loading data into a data warehouse should not be that difficult, right? A data
warehouse consists of database tables, so, put it simply, all we need to do
is to put some records into these tables, right? Well, that is generally the
idea, but there are several things that we need consider as well.
Some of the points we may want to consider are listed below, which we will
discuss one by one.
- When And How To Load
- When not used, restrict access
- Flipping
- After disconnect from source
- Many ways and source types
- Bulk upload
- Incremental
- The Stage
- Database or files
- Timestamp column
- Source system column
- Data Firewall
- Data Firewall / Data Quality / Information Quality
- Bad/good data
- Reject/correct/allow
- Allowing bad data
- Reporting and correction
- Static values and statistical comparison
- No data goes into DW unless …
- If the data is crap the DW is useless.
- ODS
- Operation data store or source
- Key management
- ODS firewall
- Loading Dimension Tables
- Unknown records
- Date and time of day dimension
- Fiscal dates
- Natural keys
- Surrogate key management
- Referential integrity
- Static dimension: are they really?
- No body touches the DW data directly
- Mr. DWA not DBA
- Upsert operation: by example.
- Deletion
- DDS dimension
- Overlay: more than 1 source
- SCD 2: validity range, active flag, insert and update
- SCD 3: shifting the columns
- Loading fact tables:
- Key mgt: slim dim table – customer
- Natural key changes: examples
- Unknown records
- Upsert
- Deletion
- Source PK: if source allows dup? Dedup
- Index: take them off (10% rule)
- Internal (really?) DW entity: targets
- Transactional and snapshot fact table
- Purging/pruning: parameterize
- Reloading history
- Logging: audit
- Number of rows, timestamp: for performance analysis
- Log the unknown
- DQ log/audit
- Control: failure and restart.
- Invocation and cycle dependency
- Stage ETL invokes DW ETL: sync – minimise time lost
- Two independent cycles – flexibility
- Restart modes: from point of failure or beginning, dates
- Simulation: power failure, source connectivity
- Scenario: ERP down, backup overrun
- Restoring DW
- Real life: Disk failure – RAID
- Save stage
- Backup to disk for 3 days
- Tape > 3 days: test restore!
- Offsite
- Multidimensional Database
- Reprocess dimensions
- Reprocess cubes - incremental
- 11. EAI and EII
Looking at the items above, it is going to be a very long article, so I split
it into several parts. Please bear with me as we go through it. If you have any comments
or questions or want to discuss anything with me, just click the link above
and put them in the forum, I will be happy to reply or discuss them.
1. When And How To Load
We want to load the data warehouse ideally when it is not used by the users,
i.e. when the users are not running queries or reports against the data warehouse.
This is necessary to minimize the performance degradation experienced by the
users, i.e. the user queries may significantly slower than normal rate because
we are loading the warehouse. Sometimes this is also necessary to prevent anomalies
or errors experienced by the users, i.e. users may experience data inconsistencies
during the loading time. Because of this sometimes it is necessary to prevent
users from accessing the data warehouse when we are populating it.
There is a technique that enables a data warehouse to be used when we are populating
it. Basically we have 2 copies of the data warehouse tables. When we are populating
copy A, all user queries are directed to copy B. Then when we finish populating
copy A, all user queries are redirected to copy A and we populate copy B. This
technique is quite a popular in custom data warehouse development and it is
used in some off-the-shelf data warehouse applications such as E.piphany CRM
(now acquired by SSA). There is one caveat when using this technique, apart
from the obvious requirement of extra disk space: because of arbitrary table
names we need to use dynamic SQL so it has an impact on the user query performance
a little bit, as well as the loading performance.
To minimise the load on the source systems, we want to load our data warehouse
after we have disconnected from the source. So we get the data out from the
source systems as quickly as we possibly can into stage, then disconnect from
the source systems, then start loading into the warehouse. If we do the loading
at the same time as doing the extracting, the overall elapsed time could be
shorter but the load on the source systems could be higher.
There are many ways to load a data warehouse. We can load from message queue,
Changed Data Capture (CDC) files, web services or Enterprise Application Integration
(EAI) server events. If we use ETL, to get good loading performance, ideally
we want to populate the data warehouse using bulk upload. In SQL Server 2005
we want to use raw file format, which will give us the best performance. Also,
ideally we want populate the data warehouse incrementally, i.e. load only the
data which was changed since the last extraction.
2. The Stage
If your stage is a database rather than a file system, you would probably want
to put add a few columns on each of your stage table. Columns such as time stamp
(the time when the record was loaded into stage) and source system ID (which
source system this data was coming from) could be quite useful when loading
the stage data into the data warehouse.
General considerations about the benefits and the disadvantages of having a
stage are discussed in this article.
3. Data Firewall
Also known as data quality or information quality. Some people would argue
about the differences between these 3 terms but that discussion is for another
article. For now all we need to know is that when loading data into data warehouse,
we need to have some kind of data filter: bad data goes into quarantine area,
good data goes into the warehouse. That is why this filter is called a data
firewall, because of its similarity to IP filtering.
There are 3 possible things we can do to bad data: reject it, correct it, or
allow it through without correction. Rejecting it means we don’t put it
into the warehouse. Correction is modifying the data based on certain predefined
rule, and we put it into the warehouse. The third one is to allow the bad data
into the warehouse. In all three cases we need to log the bad data.
About allowing bad data into the warehouse, it is where the bad data is within
the tolerance limit. It is a warning rather than an error. For example, the
normal value is between 5000 and 6000. If the value is between 4000 and 5000
or between 6000 and 7000, it is still acceptable so we put it in the warehouse.
If the value is less than 4000 or more than 7000 we reject it. If the value
is 5500 (good data) we allow it through and we don’t log it but if the
value is 6500 (bad data but within tolerance) we allow it through and we log
it.
An example of correction that we do automatically to the data before allowing
it to the warehouse is straightening abbreviations (such as changing “Herts”
with “Hertfordshire” and “Adv.” or “Adv”
with “Advanced”) and upper/lower case conversion (such as replacing
“JOHN CUNNINGHAM” or “john Cunningham” or John CUnningham”
with “John Cunningham”). Another example is changing “MR”
and “mr” to “Mr”. You may have heard people using the
term scrubbing or data scrubbing. Well this is it, above are real examples of
data scrubbing. It may be boring and not sexy, but keeping the data tidy and
neat is so core to data quality.
The quarantine area contains rejected data, which will then be processed (say
daily) and reported to different data owner individuals who are responsible
to make sure that these errors are then corrected in the source system. Next
time the data is extracted from the source system, the ETL will pull the correct
data into the warehouse.
The easiest type of data firewall rules is a comparison to a static number
or a range, such as the example above. But in the data firewall we can also
do a statistical comparison, i.e. if the value is within the last 90 days average
plus minus 10%, then allow it through, otherwise reject the data. We can also
compare the incoming data with the data that is already in the warehouse. We
could also compare one part of the incoming data with another part the incoming
data, for example in the case of header and detail table. But we have to be
careful here: that other part that we are referencing to may not be correct
or complete, especially if they are in different extraction cycles.
By far the most common type of firewall rules is database integrity rule, such
as null/not null, foreign key / orphanage checking, and character length. Why
would we want to check database rules if it is done at database level anyway?
Because if it is done on the DQ system, it will be reported and corrected. And
we can prevent the bad data from entering the warehouse. Or correct the data
and allow it through. DQ system with its data firewall rules is much more flexible
than the database referential integrity instruments. And it is all logged so
it is auditable later on at any time.
It really pays to spend the time to build good solid firewall rules, because
at the end of the day, if the quality of data in the warehouse is crap, then
the data warehouse is useless. It is very important to restrict that the only
way to load the data into the warehouse is through the data firewall. No data
should go into the warehouse without going through the data firewall. I can’t
stress this important enough. If there is only 1 thing that we are going to
learn today from this article, I hope it is the data quality: never allow the
data into the warehouse unless it goes through the firewall.
4. Operational Data Store (ODS)
Some people mixed operational data store (ODS) with operational data source.
The former is part of data warehouse and the latter is the source system(s).
ODS is the normalised database between the stage and the dimensional store.
General advantages and disadvantages of having an ODS has been discussed on
this article.
With regards to data warehouse loading, one advantage of having an ODS is making
the key management in the dimensional store easier. The second advantage is
it enables us to have an ODS firewall, i.e. a set of data quality rules which
checks the incoming data from stage against the data in the ODS. It is easier
and faster to do this checks against ODS than against the dimensional store,
because the ODS is in (at least) 3rd normal form whilst the dimensional store
is in denormalised form.
5. Loading Dimension Tables
In each of the dimension tables there should be a record with a key value of
0 (or –1 if you want) and the attributes are set to either 0 (for numeric
columns), “UNKNOWN” (for character columns such as Description),
“” or blank string (for other character columns) and low value date
such as 1/1/1900 (for date columns). These records are called the unknown records.
These records are not loaded on the normal daily load, but are created as part
of data warehouse installation / setup.
There are also other data that are preloaded as part of data warehouse installation,
such as the date dimension and the time of day dimension. Some people think
it is good to populate the date dimension (one row for a day) for 100 or 200
years, so we don’t have to extend it ever. This is generally not a good
practice as the date dimension will be referenced by (almost) every single fact
table in the warehouse and it would slow down query performance. Some people
do the opposite approach: populate the date dimension only for 1 or 2 years
(to maximize query performance) and extend it every (other) year. This is also
not a good approach because of the administration overhead and because of the
risk: date dimension is used every where; if you mess around with it every year,
chances are we could break the whole warehouse if we made a mistake. A better
approach is to agree with the business users how far back we want to load the
history. Normally this is 2-3 years and in extreme cases 10 years. Then we populate
the date dimension since that year until the next 5 years. Every 5 years we
extend it for another 5 years.
Fiscal dates are generally known for the next 5 years because they are constant.
For example a fiscal period could be defined as 13 equal portions of a fiscal
year consisting of 4 fiscal weeks each, and the fiscal year starts on 1st August
until 31st July. Or the fiscal period may be divided into fiscal weeks using
a pattern such as 445445445445, which means the first and second fiscal periods
consist of 4 fiscal weeks, the third period consist of 5 fiscal weeks, the fourth
and fifth periods consist of 4 fiscal weeks, the sixth period consist of 5 fiscal
weeks, and so on.
If we know the fiscal calendar for the next 5 years, we load the fiscal calendar
into the date dimension when preloading the warehouse at installation time.
If the fiscal calendar changes every year and we only know it one year in advance,
we load the fiscal calendar into the date dimension every year but we should
still populate the date dimension for the next 5 years at installation time,
leaving the fiscal date columns as their default or assumed value.
Each source table should ideally have a natural key, i.e. the column that we
use to identify duplicates when we update the data warehouse dimension table.
But, in reality, they don’t always have natural keys. In these cases when
we upsert (update and insert) into dimension table we compare all columns.
Surrogate key management has always been the heart of loading dimension tables.
The simplest way (and probably the best way) is to use identity column (or sequence
column for those who uses Oracle) and let the RDBMS manage the surrogate key.
I’ve seen cases when the key is not surrogate/meaningless and in those
cases they are managed manually (using a subroutine), which not a good practice.
In most cases, a normal 4-byte integer (2 billion, ish) is enough for surrogate
keys but there are cases that we need to use big int (if you use SQL Server),
especially when it is a large dimension and when SCD 2 is in place. In any case
the surrogate key should not be varchar (or varchar2 in Oracle), and it should
not be decimal (or number in Oracle or Decimal in Teradata). But time and time
again I found that intelligent varchar is used as surrogate keys. Not only this
impacts both ETL and end user performance, but also it makes things more complicated.
Referential Integrity (RI) has always been a long standing debate on key management.
One group is saying that we do need to impose RI because of data integrity.
The other group is saying that we should not impose RI because it shows down
performance a lot and RI is managed within the ETL anyway. RI is normally implemented
using foreign keys but sometimes also in the form of cascade delete / cascade
update, (after) trigger constraints and in SQL Server also check constraints.
My view is that we need to implement RI, because in many cases the performance
degradation is not that much and because data integrity is paramount in data
warehousing.
Static dimensions are dimensions that we never update. They are setup at DW
install and stay constant through out the year. Examples of these dimensions
are some of the decode tables, for example: order type codes (O for order, Q
for quote, etc), delivery codes (S for scheduled, OD for on demand, etc). These
dimensions normally only contains very few rows, e.g. less than 10 rows. Larger
decode tables which contains 25 rows or more are not really that static, they
tend to change a few times a year. In these cases I’d advice to setup
an ad hoc ETL to bring the updates into warehouse when required / on ad hoc
basis. Remember to put a DQ on the ETL import. Remember also not to put these
ad hoc ETL in the main batch or jobs but separate them and do not schedule them.
They should be ad hoc (running on demand), not scheduled!
It is a matter of principle that no body touches the data warehouse directly.
No body updates the data inside the warehouse directly! This includes Mr. DWA
(Data Warehouse Administrator). Hmmm that’s a new role that developed
in the recent years – attention Mr. DBAs! All updates must be done through
ETL, which is data firewalled (and therefore quality assured) and logged (and
therefore auditable), and under control system (and therefore recoverable on
failure). Even those ‘nearly static’ dimensions absolutely need
to be put on the ETL build. Yes without a doubt. Even if the changes only happens
once a year on these dimension. It’s worth repeating it again: no changes
go into the warehouse without going through data firewall. If the data is crap
the data warehouse is useless.
In part 2 we will examine the basic steps to do update and insert operation
(upsert) into a dimensional table.
Vincent Rainardi
17th October 2006