SQLServerCentral Article

Data Warehouse Loading

,

Loading data into a data warehouse should not be that difficult, right? A data

warehouse consists of database tables, so, put it simply, all we need to do

is to put some records into these tables, right? Well, that is generally the

idea, but there are several things that we need consider as well.

Some of the points we may want to consider are listed below, which we will

discuss one by one.

  1. When And How To Load
    • When not used, restrict access
    • Flipping
    • After disconnect from source
    • Many ways and source types
    • Bulk upload
    • Incremental
  2. The Stage
    • Database or files
    • Timestamp column
    • Source system column
  3. Data Firewall
    • Data Firewall / Data Quality / Information Quality
    • Bad/good data
    • Reject/correct/allow
    • Allowing bad data
    • Reporting and correction
    • Static values and statistical comparison
    • No data goes into DW unless …
    • If the data is crap the DW is useless.
  4. ODS
    • Operation data store or source
    • Key management
    • ODS firewall
  5. Loading Dimension Tables
    • Unknown records
    • Date and time of day dimension
    • Fiscal dates
    • Natural keys
    • Surrogate key management
    • Referential integrity
    • Static dimension: are they really?
    • No body touches the DW data directly
    • Mr. DWA not DBA
    • Upsert operation: by example.
    • Deletion
    • DDS dimension
    • Overlay: more than 1 source
    • SCD 2: validity range, active flag, insert and update
    • SCD 3: shifting the columns
  6. Loading fact tables:
    • Key mgt: slim dim table – customer
    • Natural key changes: examples
    • Unknown records
    • Upsert
    • Deletion
    • Source PK: if source allows dup? Dedup
    • Index: take them off (10% rule)
    • Internal (really?) DW entity: targets
    • Transactional and snapshot fact table
    • Purging/pruning: parameterize
    • Reloading history
  7. Logging: audit
    • Number of rows, timestamp: for performance analysis
    • Log the unknown
    • DQ log/audit
  8. Control: failure and restart.
    • Invocation and cycle dependency
    • Stage ETL invokes DW ETL: sync – minimise time lost
    • Two independent cycles – flexibility
    • Restart modes: from point of failure or beginning, dates
    • Simulation: power failure, source connectivity
    • Scenario: ERP down, backup overrun
  9. Restoring DW
    • Real life: Disk failure – RAID
    • Save stage
    • Backup to disk for 3 days
    • Tape > 3 days: test restore!
    • Offsite
  10. Multidimensional Database
    • Reprocess dimensions
    • Reprocess cubes - incremental
  11. 11. EAI and EII

Looking at the items above, it is going to be a very long article, so I split

it into several parts. Please bear with me as we go through it. If you have any comments

or questions or want to discuss anything with me, just click the link above

and put them in the forum, I will be happy to reply or discuss them.

1. When And How To Load

We want to load the data warehouse ideally when it is not used by the users,

i.e. when the users are not running queries or reports against the data warehouse.

This is necessary to minimize the performance degradation experienced by the

users, i.e. the user queries may significantly slower than normal rate because

we are loading the warehouse. Sometimes this is also necessary to prevent anomalies

or errors experienced by the users, i.e. users may experience data inconsistencies

during the loading time. Because of this sometimes it is necessary to prevent

users from accessing the data warehouse when we are populating it.

There is a technique that enables a data warehouse to be used when we are populating

it. Basically we have 2 copies of the data warehouse tables. When we are populating

copy A, all user queries are directed to copy B. Then when we finish populating

copy A, all user queries are redirected to copy A and we populate copy B. This

technique is quite a popular in custom data warehouse development and it is

used in some off-the-shelf data warehouse applications such as E.piphany CRM

(now acquired by SSA). There is one caveat when using this technique, apart

from the obvious requirement of extra disk space: because of arbitrary table

names we need to use dynamic SQL so it has an impact on the user query performance

a little bit, as well as the loading performance.

To minimise the load on the source systems, we want to load our data warehouse

after we have disconnected from the source. So we get the data out from the

source systems as quickly as we possibly can into stage, then disconnect from

the source systems, then start loading into the warehouse. If we do the loading

at the same time as doing the extracting, the overall elapsed time could be

shorter but the load on the source systems could be higher.

There are many ways to load a data warehouse. We can load from message queue,

Changed Data Capture (CDC) files, web services or Enterprise Application Integration

(EAI) server events. If we use ETL, to get good loading performance, ideally

we want to populate the data warehouse using bulk upload. In SQL Server 2005

we want to use raw file format, which will give us the best performance. Also,

ideally we want populate the data warehouse incrementally, i.e. load only the

data which was changed since the last extraction.

2. The Stage

If your stage is a database rather than a file system, you would probably want

to put add a few columns on each of your stage table. Columns such as time stamp

(the time when the record was loaded into stage) and source system ID (which

source system this data was coming from) could be quite useful when loading

the stage data into the data warehouse.

General considerations about the benefits and the disadvantages of having a

stage are discussed in this article.

3. Data Firewall

Also known as data quality or information quality. Some people would argue

about the differences between these 3 terms but that discussion is for another

article. For now all we need to know is that when loading data into data warehouse,

we need to have some kind of data filter: bad data goes into quarantine area,

good data goes into the warehouse. That is why this filter is called a data

firewall, because of its similarity to IP filtering.

There are 3 possible things we can do to bad data: reject it, correct it, or

allow it through without correction. Rejecting it means we don’t put it

into the warehouse. Correction is modifying the data based on certain predefined

rule, and we put it into the warehouse. The third one is to allow the bad data

into the warehouse. In all three cases we need to log the bad data.

About allowing bad data into the warehouse, it is where the bad data is within

the tolerance limit. It is a warning rather than an error. For example, the

normal value is between 5000 and 6000. If the value is between 4000 and 5000

or between 6000 and 7000, it is still acceptable so we put it in the warehouse.

If the value is less than 4000 or more than 7000 we reject it. If the value

is 5500 (good data) we allow it through and we don’t log it but if the

value is 6500 (bad data but within tolerance) we allow it through and we log

it.

An example of correction that we do automatically to the data before allowing

it to the warehouse is straightening abbreviations (such as changing “Herts”

with “Hertfordshire” and “Adv.” or “Adv”

with “Advanced”) and upper/lower case conversion (such as replacing

“JOHN CUNNINGHAM” or “john Cunningham” or John CUnningham”

with “John Cunningham”). Another example is changing “MR”

and “mr” to “Mr”. You may have heard people using the

term scrubbing or data scrubbing. Well this is it, above are real examples of

data scrubbing. It may be boring and not sexy, but keeping the data tidy and

neat is so core to data quality.

The quarantine area contains rejected data, which will then be processed (say

daily) and reported to different data owner individuals who are responsible

to make sure that these errors are then corrected in the source system. Next

time the data is extracted from the source system, the ETL will pull the correct

data into the warehouse.

The easiest type of data firewall rules is a comparison to a static number

or a range, such as the example above. But in the data firewall we can also

do a statistical comparison, i.e. if the value is within the last 90 days average

plus minus 10%, then allow it through, otherwise reject the data. We can also

compare the incoming data with the data that is already in the warehouse. We

could also compare one part of the incoming data with another part the incoming

data, for example in the case of header and detail table. But we have to be

careful here: that other part that we are referencing to may not be correct

or complete, especially if they are in different extraction cycles.

By far the most common type of firewall rules is database integrity rule, such

as null/not null, foreign key / orphanage checking, and character length. Why

would we want to check database rules if it is done at database level anyway?

Because if it is done on the DQ system, it will be reported and corrected. And

we can prevent the bad data from entering the warehouse. Or correct the data

and allow it through. DQ system with its data firewall rules is much more flexible

than the database referential integrity instruments. And it is all logged so

it is auditable later on at any time.

It really pays to spend the time to build good solid firewall rules, because

at the end of the day, if the quality of data in the warehouse is crap, then

the data warehouse is useless. It is very important to restrict that the only

way to load the data into the warehouse is through the data firewall. No data

should go into the warehouse without going through the data firewall. I can’t

stress this important enough. If there is only 1 thing that we are going to

learn today from this article, I hope it is the data quality: never allow the

data into the warehouse unless it goes through the firewall.

4. Operational Data Store (ODS)

Some people mixed operational data store (ODS) with operational data source.

The former is part of data warehouse and the latter is the source system(s).

ODS is the normalised database between the stage and the dimensional store.

General advantages and disadvantages of having an ODS has been discussed on

this article.

With regards to data warehouse loading, one advantage of having an ODS is making

the key management in the dimensional store easier. The second advantage is

it enables us to have an ODS firewall, i.e. a set of data quality rules which

checks the incoming data from stage against the data in the ODS. It is easier

and faster to do this checks against ODS than against the dimensional store,

because the ODS is in (at least) 3rd normal form whilst the dimensional store

is in denormalised form.

5. Loading Dimension Tables

In each of the dimension tables there should be a record with a key value of

0 (or –1 if you want) and the attributes are set to either 0 (for numeric

columns), “UNKNOWN” (for character columns such as Description),

“” or blank string (for other character columns) and low value date

such as 1/1/1900 (for date columns). These records are called the unknown records.

These records are not loaded on the normal daily load, but are created as part

of data warehouse installation / setup.

There are also other data that are preloaded as part of data warehouse installation,

such as the date dimension and the time of day dimension. Some people think

it is good to populate the date dimension (one row for a day) for 100 or 200

years, so we don’t have to extend it ever. This is generally not a good

practice as the date dimension will be referenced by (almost) every single fact

table in the warehouse and it would slow down query performance. Some people

do the opposite approach: populate the date dimension only for 1 or 2 years

(to maximize query performance) and extend it every (other) year. This is also

not a good approach because of the administration overhead and because of the

risk: date dimension is used every where; if you mess around with it every year,

chances are we could break the whole warehouse if we made a mistake. A better

approach is to agree with the business users how far back we want to load the

history. Normally this is 2-3 years and in extreme cases 10 years. Then we populate

the date dimension since that year until the next 5 years. Every 5 years we

extend it for another 5 years.

Fiscal dates are generally known for the next 5 years because they are constant.

For example a fiscal period could be defined as 13 equal portions of a fiscal

year consisting of 4 fiscal weeks each, and the fiscal year starts on 1st August

until 31st July. Or the fiscal period may be divided into fiscal weeks using

a pattern such as 445445445445, which means the first and second fiscal periods

consist of 4 fiscal weeks, the third period consist of 5 fiscal weeks, the fourth

and fifth periods consist of 4 fiscal weeks, the sixth period consist of 5 fiscal

weeks, and so on.

If we know the fiscal calendar for the next 5 years, we load the fiscal calendar

into the date dimension when preloading the warehouse at installation time.

If the fiscal calendar changes every year and we only know it one year in advance,

we load the fiscal calendar into the date dimension every year but we should

still populate the date dimension for the next 5 years at installation time,

leaving the fiscal date columns as their default or assumed value.

Each source table should ideally have a natural key, i.e. the column that we

use to identify duplicates when we update the data warehouse dimension table.

But, in reality, they don’t always have natural keys. In these cases when

we upsert (update and insert) into dimension table we compare all columns.

Surrogate key management has always been the heart of loading dimension tables.

The simplest way (and probably the best way) is to use identity column (or sequence

column for those who uses Oracle) and let the RDBMS manage the surrogate key.

I’ve seen cases when the key is not surrogate/meaningless and in those

cases they are managed manually (using a subroutine), which not a good practice.

In most cases, a normal 4-byte integer (2 billion, ish) is enough for surrogate

keys but there are cases that we need to use big int (if you use SQL Server),

especially when it is a large dimension and when SCD 2 is in place. In any case

the surrogate key should not be varchar (or varchar2 in Oracle), and it should

not be decimal (or number in Oracle or Decimal in Teradata). But time and time

again I found that intelligent varchar is used as surrogate keys. Not only this

impacts both ETL and end user performance, but also it makes things more complicated.

Referential Integrity (RI) has always been a long standing debate on key management.

One group is saying that we do need to impose RI because of data integrity.

The other group is saying that we should not impose RI because it shows down

performance a lot and RI is managed within the ETL anyway. RI is normally implemented

using foreign keys but sometimes also in the form of cascade delete / cascade

update, (after) trigger constraints and in SQL Server also check constraints.

My view is that we need to implement RI, because in many cases the performance

degradation is not that much and because data integrity is paramount in data

warehousing.

Static dimensions are dimensions that we never update. They are setup at DW

install and stay constant through out the year. Examples of these dimensions

are some of the decode tables, for example: order type codes (O for order, Q

for quote, etc), delivery codes (S for scheduled, OD for on demand, etc). These

dimensions normally only contains very few rows, e.g. less than 10 rows. Larger

decode tables which contains 25 rows or more are not really that static, they

tend to change a few times a year. In these cases I’d advice to setup

an ad hoc ETL to bring the updates into warehouse when required / on ad hoc

basis. Remember to put a DQ on the ETL import. Remember also not to put these

ad hoc ETL in the main batch or jobs but separate them and do not schedule them.

They should be ad hoc (running on demand), not scheduled!

It is a matter of principle that no body touches the data warehouse directly.

No body updates the data inside the warehouse directly! This includes Mr. DWA

(Data Warehouse Administrator). Hmmm that’s a new role that developed

in the recent years – attention Mr. DBAs! All updates must be done through

ETL, which is data firewalled (and therefore quality assured) and logged (and

therefore auditable), and under control system (and therefore recoverable on

failure). Even those ‘nearly static’ dimensions absolutely need

to be put on the ETL build. Yes without a doubt. Even if the changes only happens

once a year on these dimension. It’s worth repeating it again: no changes

go into the warehouse without going through data firewall. If the data is crap

the data warehouse is useless.

In part 2 we will examine the basic steps to do update and insert operation

(upsert) into a dimensional table.

Vincent Rainardi

17th October 2006

Rate

5 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (3)

You rated this post out of 5. Change rating