Upsert Dimension Table
(Data Warehouse Loading – Part 2)
In part 1 we went through when and how to load, stage, ODS and some considerations
for loading dimension tables. Now we’re getting down to earth here, i.e.
let’s get into the real work. Let’s see the basic steps to do update
and insert (upsert) into a dimension table. And then we finish the rest of Dimension
Table Loading.
The dimension table we want to load has 5 columns: surrogate_key, natural_key,
attribute1, attribute2 and load_time. Surrogate key column is the dimension’s
primary key and it is an identity (1,1) column. Natural key is the primary key
in the source system. Attribute1 and attribute2 are the dimension’s attributes.
Load_time is the time stamp when the row was last changed; this will be useful
for loading the changes into further dimensional store or into data marts later
on.
Source table
natural_key | attribute1 | attribute2 | created | last_updated |
10 | A | AAA | 21/11/2005 | 02/10/2006 |
20 | B | BB | 02/10/2006 | 02/10/2006 |
Dimension table
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 21/11/2005 |
Note:
1. In this example the load_time, created and last_updated are all dates.
In practice ideally they are date time, e.g. 02/10/2006 15:17:02.2341
2. attribute2 on the source table for natural_key 10 is ‘AAA’.
Yes it is different from the dimension record where the attribute is ‘AA’.
Step 1. Create The Temp Table
SELECT * INTO #dim FROM dw.dbo.dimension WHERE 1 = 0
The temp table after it is created:
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
Note:
1. By deliberately specifying a false condition, we create an empty temp table,
taking the structure of the target dimension table.
2. If the table is big it may be a better for performance to use a permanent
table rather than a temporary table.
3. If you use SQL Server you may want / need to increase the size of the temp
database depending on the size of your load.
4. SELECT INTO is better than CREATE TABLE because it is more flexible to adapt
to structural changes.
Step 2. Populate The Temp Table
INSERT INTO #dim (surrogate_key, attribute1, attribute2, load_time)
SELECT ISNULL(dim.surrogate_key, 0), src. attribute1, src. attribute2, @current_load_time
FROM stg.dbo.source src
LEFT JOIN dw.dbo.dimension dim ON src.natural_key = dim.natural_key
The temp table after it is populated:
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 17/10/2006 |
0 | 20 | B | BB | 17/10/2006 |
Source table
natural_key | attribute1 | attribute2 | created | last_updated |
10 | A | AAA | 21/11/2005 | 02/10/2006 |
20 | B | BB | 02/10/2006 | 02/10/2006 |
Dimension table
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 21/11/2005 |
Note:
1. Here we populate the temp table from the source table on the stage.
2. Notice that we take the surrogate key from the dimension table, by joining
with the source table on the natural key.
3. Because it is a left join, dim.surrogate_key will be NULL if the row doesn’t
exist on the dimension table.
4. The NULL surrogate keys are the replaced by 0, using ISNULL function.
5. Notice that we don’t populate the load_time with getdate(), but with
a variable named @current_load_time. This variable is populated with getdate()
at the beginning of the loading batch and used by all processes in the batch.
This is necessary so that in the event of failure, we know the point in time
we have to restart the process from.
6. If you use a snowflake schema rather than a star schema, on this step you
also need to look up the foreign keys on the child dimension table.
Step 3. Update The Dimension Table
UPDATE dim
SET dim.attribute1 = tmp.attribute1,
dim.attribute1 = tmp.attribute2,
dim.load_time = tmp.load_time
FROM dw.dbo.dimension dim
INNER JOIN #dim tmp
ON tmp.surrogate_key = dim.surrogate_key
WHERE tmp.surrogate_key <> 0
AND (dim.attribute1 <> tmp.attribute1 OR dim.attribute2 <> tmp.attribute2)
Temp table
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 17/10/2006 |
0 | 20 | B | BB | 17/10/2006 |
Dimension table
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 17/10/2006 |
Note:
1. Here we update the dimension table, based on the data on the temp table.
In this case attribute2 was updated from ‘AA’ to ‘AAA’.
2. We only update the rows where the tmp table’s surrogate key is not
0, i.e. the rows already exist on the dimension table. For the rows where the
surrogate key is 0 (not exist on the dimension table), we will insert them into
the dimension table later on.
3. Notice that when updating rows we update the load time column as well.
4. The last line is used to specify which changes we want to pickup.
Step 4. Insert Into Dimension Table
INSERT INTO dw.dbo.dimension (natural_key, attribute1, attribute 2, load_time)
SELECT natural_key, attribute1, attribute2, load_time
FROM #dim
WHERE surrogate_key = 0
Temp table
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 17/10/2006 |
0 | 20 | B | BB | 17/10/2006 |
Dimension table
surrogate_key | natural_key | attribute1 | attribute2 | load_time |
1 | 10 | A | AAA | 17/10/2006 |
2 | 20 | B | BB | 17/10/2006 |
Note:
1. Notice that we don’t specify SK column when inserting, i.e. let the
RDBMS handle it.
2. When creating / setting up the data warehouse, we set the surrogate key column
on the dimension table to identity(1,1).
In every step above we need to do error handling and logging. Error handling
is important because if the loading fails on any steps, we need to be able to
recover from the failure. Logging is important to know what exactly happened
on each steps, i.e. how many records are processed, etc.
At the end of the program, we should not forget to clean everything up, i.e.
drop the temp table(s), follow control protocol e.g. set the process to complete
state, etc.
The above code shows how to do upsert with native SQL codes, which is very
fast. But it is worth to note here, that good dedicated ETL tools such as Informatica
and Data Integrator have the facilities to do in-memory lookups which has very
good performance too. Disk-base lookup is definitely not they way to go here,
as they are slow. Mainly because there are a lot of SQL statements to execute,
i.e. one for each lookup, and each of these statements ‘touches’
the disk, which is a costly operation.
Deletion
If in the source table a row is deleted, before we delete the row in the dimension
table, we must check that the surrogate key in that deleted row is not used
in any of the fact tables. And if we use snowflake schema, we should also check
any dimension tables that refer to this dimension table. If the surrogate key
is used anywhere else, then we can not delete the row.
The above causes an ambiguity: if a dimension row exists in the data warehouse,
how can we tell if it exists in the source system? Because of that we don’t
usually delete rows from dimension tables. Instead, we mark that the row is
‘inactive’. Some people call this “soft delete”. In
each dimension table, we have several standard columns at the end: 3 timestamp
columns (last updated time, created time, and deleted time), 1 ‘IsActive’
flag column. It is this IsActive column that we update from 1 to 0 when the
row is deleted in the source system.
DDS Dimension
Dimensional Data Store (DDS) is a star schema (or snowflake) relational database
which is populated from the ODS. DDS is the used to populate the multidimensional
databases such as Cognos PowerPlay, Hyperion Essbase, Business Object Universe
or SQL Server Analysis Services. DDS is also used by report (such as Reporting
Services) and dashboard programs. Excel BI Add On and Proclarity Desktop also
connect to DDS to get dimensional data.
When we populate a dimensional table in the DDS based on the data in the ODS,
we have 2 options, e.g. there are 2 ways of doing it: a) empty and reload, or
b) incremental. Empty and reload is literally that: empty the DDS dimension
table and the reload all records from ODS. Incremental means that we load the
changed data from ODS. The change data is identified using the 3 timestamp columns
mentioned earlier.
When emptying a DDS table we should not use “delete from table”,
i.e. delete all records. Instead, we need to use “truncate table”
command. This applies to both SQL Server 2005 and Oracle 10g. Truncate is instantaneous,
because it is not logged. Delete takes a lot longer because it is logged. In
Teradata we do not have truncate. Your best bet is to use DELETE ... ALL. Multiload
delete step can also be used, but it could be a bit slower. When you use incremental
approach, try to get the 3 timestamp columns indexed if possible, as this will
significantly reduce the loading time, especially if it is a large dimension.
Generally speaking, truncate and reload is faster than incremental. If the
number of rows is below 10,000 and the dimension width (total bytes in a row)
is less than 1000, truncate and reload is faster. If your dimension has more
than 10,000 rows and the width is more than 1000, then there is a significant
chance that incremental approach can beat truncate and reload performance.
Overlay
Consider a case in the airline industry, where you have a dimension called
crew, which contains the pilots, co-pilots, navigators and cabin crew. There
are 2 source systems that can be used to populate this dimension: flight information
system (FIS) and human resources information system (HRIS). The FIS has an air¬_crew
table, which is keyed on crew_id. The HRIS has an employee table, which is keyed
on employee_id. Fortunately, the air_crew table also contains employee_id. The
employee table is richer than the air_crew table, i.e. it contains 36 attributes
including title, name, address, employment band, benefits employment start date,
normal working hours, salary, normal position, secondary position, rank, employment
status, date of birth, bank details, date created and date last modified. The
air_crew table on the other hand, only contains title, name, positions plus
2 timestamp columns. The employee table only contains permanent employee (700
records), it does not contain non-employees or temps, where as the air_crew
table contains all crews on every flights (1000 records), both temp and permanent.
The crew dimension in the data warehouse contains title, name, address, working
hours positions, date of birth, employment status and rank, plus the standard
dimensional control columns. In this case we need to load from 2 sources. This
is called overlay. How do you load with overlay? Which table is the base table
and which one is the top up? Employee table is richer by columns but not complete
by rows, air_crew is not rich but complete by rows.
As a principle we need to take the one that is complete rows as the base table
(primary) e.g. the blue box above. And the richer table by column as the top
up table (secondary), e.g. the yellow box above. Here is how it works: load
the base first, by using upsert, setting the missing attribute as their standard
values or blank. After we have a complete number of rows in the dimension table,
we then load the secondary table, by using update. We don’t do insert
or delete when loading the secondary, unless we are using SCD2 – see below.
Don’t forget to put ISNULL (if you use SQL Server) to change null into
0 or low value date or blank string.
SCD 2
Slowly Changing Dimension (SCD) type 2 is basically a technique to keep the
history on a dimension. This is done by setting the existing dimension record
as ‘outdated’ or ‘not valid anymore’ and then we create
a new record containing the new attributes and stamp it as ‘valid’,
as illustrated below:
Before the change
surrogate_key | natural_key | attribute | effective_date | expired_date | most_recent |
1 | 10 | A | 1/1/1900 00:00:00 | 31/12/2200 00:00:00 | 1 |
After the change
surrogate_key | natural_key | attribute | effective_date | expired_date | most_recent |
1 | 10 | A | 1/1/1900 00:00:00 | 6/10/2006 10:21:19 | 0 |
2 | 10 | B | 6/10/2006 10:21:20 | 31/12/2200 00:00:00 | 1 |
The concept of SCD was first described by Ralph Kimball and Margie Ross in
their book The
Data Warehouse Toolkit, chapter 4. They later on wrote a few articles on
SCD: SCD
Are Not Always As Easy As 123 (Margy Ross and Ralph Kimball, March 2005),
Both Historical and Current Perspectives (Margy Ross, May 2004), Combining
SCD Techniques (Margy Ross, October 2000) and Processing
Slowly Changing Dimensions During Initial Data Load (Lawrence Corr, June
2000). If SCD is new for you, it would help if you familiar yourself with the
concept by reading these materials.
Now let’s talk about loading data into an SCD type 2 dimension table.
Each SCD 2 table should have effective data and expiry date columns. Otherwise
if we are asked what is the condition of the dimension at a certain date in
the past, we couldn’t answer it. Using these 2 date column we know the
condition (attribute values) at any time.
When a dimension record is created, the Effective Date column is set to a low
value date (such as 1/1/1900) and the Expired Date is set to a high value date,
such as 31/12/2100. This is to indicate that the record is valid at any time,
i.e. there is no time limit. When we have an update, we set the Effective Date
column to current time. Then we insert a new row, setting the Effective Date
column to current time and the valid to a high value date. This way, we explicitly
declare the validity period of both the old row and the new row. Please see
the table above (the one labelled as "After the change") for illustration.
The Effective Date and the Expired Date is very useful but unfortunately it
is slow. The most frequently use of an SCD 2 dimension is to support fact table
roll up with current attribute value. To do it using Effective Date and Expired
Date, our query would be something like where Effective_Date < @todays_date
and Expired_Date > @todays_date. So came the idea of most_recent column.
This is because most of the query is about the active row, not the history.
most_recent flag is binary or bit column which indicates which rows is the active
one. Now our query becomes where most_recent = 1, which is a lot quicker to
execute, especially if the most_recent column is indexed. Well, the most_recent
column should be indexed any way.
In practice SCD 2 is used in conjunction with SCD 1. Please read this article
by Ralph Kimball and Margy Ross titled "SCD
are not always as easy as 123”.
SCD 3
In Slowly Changing Dimension type 3 we keep the old attribute as a column.
Before the change
surrogate_key | natural_key | attribute1 | effective_date1 | attribute2 | effective_date2 |
1 | 10 | A | 26/12/2005 00:00:00 |
After the change
surrogate_key | natural_key | attribute1 | effective_date1 | attribute2 | effective_date2 |
1 | 10 | B | 6/10/2006 00:00:00 | A | 26/12/2005 00:00:00 |
The effective date columns help us determine the value of the attribute in
a specific point in time. Yes we need to put effective date columns for each
attribute. Yes we can have as many columns as we like for each attribute, for
example if we want to keep 5 previous values, then we need 6 pairs of columns.
In practicality this is a painful and slow process. Both for updating and for
querying. Assuming we keep 5 previous values, for every updated row, we need
to move attribute4 to attribute5 (override), attribute 3 to 4, 2 to 3, 1 to
2 and the new attribute value to 1. And the effective date columns too. See
why it’s slow?
The query is fast only if we want to know the current values, because current
values are kept in attribute1 column. But if we want to know the value of the
attribute at a particular point of time in the past, it is complicated and slow.
In practical implementation this structure is not effective to answer questions
like "give me the value of the attributes on a certain date", or rolling
up historical fact to current attribute values, so in practice SCD 3 is used
in conjunction with SCD 2, as illustrated by Margy Ross in her article Combining
SCD Techniques (October 2000), particularly her last paragraph as quoted
here:
“We've described a hybrid approach that combines the three fundamental
SCD techniques. We're creating new rows to capture change (Type 2), adding attributes
to reflect an alternative view of the world (Type 3), which are overwritten
for all earlier dimension rows for a given product (Type 1). As a student recently
suggested, perhaps we should refer to this as Type 6 (2+3+1)…”
In the next part we will continue with loading the fact tables.
Vincent Rainardi
20th October 2006