Upsert Dimension Table

Upsert Dimension Table
(Data Warehouse Loading – Part 2)

In part 1 we went through when and how to load, stage, ODS and some considerations

for loading dimension tables. Now we’re getting down to earth here, i.e.

let’s get into the real work. Let’s see the basic steps to do update

and insert (upsert) into a dimension table. And then we finish the rest of Dimension

Table Loading.

The dimension table we want to load has 5 columns: surrogate_key, natural_key,

attribute1, attribute2 and load_time. Surrogate key column is the dimension’s

primary key and it is an identity (1,1) column. Natural key is the primary key

in the source system. Attribute1 and attribute2 are the dimension’s attributes.

Load_time is the time stamp when the row was last changed; this will be useful

for loading the changes into further dimensional store or into data marts later

on.

Source table

natural_key	attribute1	attribute2	created	last_updated
10	A	AAA	21/11/2005	02/10/2006
20	B	BB	02/10/2006	02/10/2006

Dimension table

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	21/11/2005

Note:

1. In this example the load_time, created and last_updated are all dates.

In practice ideally they are date time, e.g. 02/10/2006 15:17:02.2341

2. attribute2 on the source table for natural_key 10 is ‘AAA’.

Yes it is different from the dimension record where the attribute is ‘AA’.

Step 1. Create The Temp Table

SELECT * INTO #dim FROM dw.dbo.dimension WHERE 1 = 0

The temp table after it is created:

surrogate_key	natural_key	attribute1	attribute2	load_time

Note:

1. By deliberately specifying a false condition, we create an empty temp table,

taking the structure of the target dimension table.

2. If the table is big it may be a better for performance to use a permanent

table rather than a temporary table.

3. If you use SQL Server you may want / need to increase the size of the temp

database depending on the size of your load.

4. SELECT INTO is better than CREATE TABLE because it is more flexible to adapt

to structural changes.

Step 2. Populate The Temp Table

INSERT INTO #dim (surrogate_key, attribute1, attribute2, load_time)

SELECT ISNULL(dim.surrogate_key, 0), src. attribute1, src. attribute2, @current_load_time

FROM stg.dbo.source src

LEFT JOIN dw.dbo.dimension dim ON src.natural_key = dim.natural_key

The temp table after it is populated:

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	17/10/2006
0	20	B	BB	17/10/2006

Source table

natural_key	attribute1	attribute2	created	last_updated
10	A	AAA	21/11/2005	02/10/2006
20	B	BB	02/10/2006	02/10/2006

Dimension table

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	21/11/2005

Note:

1. Here we populate the temp table from the source table on the stage.

2. Notice that we take the surrogate key from the dimension table, by joining

with the source table on the natural key.

3. Because it is a left join, dim.surrogate_key will be NULL if the row doesn’t

exist on the dimension table.

4. The NULL surrogate keys are the replaced by 0, using ISNULL function.

5. Notice that we don’t populate the load_time with getdate(), but with

a variable named @current_load_time. This variable is populated with getdate()

at the beginning of the loading batch and used by all processes in the batch.

This is necessary so that in the event of failure, we know the point in time

we have to restart the process from.

6. If you use a snowflake schema rather than a star schema, on this step you

also need to look up the foreign keys on the child dimension table.

Step 3. Update The Dimension Table

UPDATE dim

SET dim.attribute1 = tmp.attribute1,

dim.attribute1 = tmp.attribute2,

dim.load_time = tmp.load_time

FROM dw.dbo.dimension dim

INNER JOIN #dim tmp

ON tmp.surrogate_key = dim.surrogate_key

WHERE tmp.surrogate_key <> 0

AND (dim.attribute1 <> tmp.attribute1 OR dim.attribute2 <> tmp.attribute2)

Temp table

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	17/10/2006
0	20	B	BB	17/10/2006

Dimension table

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	17/10/2006

Note:

1. Here we update the dimension table, based on the data on the temp table.

In this case attribute2 was updated from ‘AA’ to ‘AAA’.

2. We only update the rows where the tmp table’s surrogate key is not

0, i.e. the rows already exist on the dimension table. For the rows where the

surrogate key is 0 (not exist on the dimension table), we will insert them into

the dimension table later on.

3. Notice that when updating rows we update the load time column as well.

4. The last line is used to specify which changes we want to pickup.

Step 4. Insert Into Dimension Table

INSERT INTO dw.dbo.dimension (natural_key, attribute1, attribute 2, load_time)

SELECT natural_key, attribute1, attribute2, load_time

FROM #dim

WHERE surrogate_key = 0

Temp table

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	17/10/2006
0	20	B	BB	17/10/2006

Dimension table

surrogate_key	natural_key	attribute1	attribute2	load_time
1	10	A	AAA	17/10/2006
2	20	B	BB	17/10/2006

Note:

1. Notice that we don’t specify SK column when inserting, i.e. let the

RDBMS handle it.

2. When creating / setting up the data warehouse, we set the surrogate key column

on the dimension table to identity(1,1).

In every step above we need to do error handling and logging. Error handling

is important because if the loading fails on any steps, we need to be able to

recover from the failure. Logging is important to know what exactly happened

on each steps, i.e. how many records are processed, etc.

At the end of the program, we should not forget to clean everything up, i.e.

drop the temp table(s), follow control protocol e.g. set the process to complete

state, etc.

The above code shows how to do upsert with native SQL codes, which is very

fast. But it is worth to note here, that good dedicated ETL tools such as Informatica

and Data Integrator have the facilities to do in-memory lookups which has very

good performance too. Disk-base lookup is definitely not they way to go here,

as they are slow. Mainly because there are a lot of SQL statements to execute,

i.e. one for each lookup, and each of these statements ‘touches’

the disk, which is a costly operation.

Deletion

If in the source table a row is deleted, before we delete the row in the dimension

table, we must check that the surrogate key in that deleted row is not used

in any of the fact tables. And if we use snowflake schema, we should also check

any dimension tables that refer to this dimension table. If the surrogate key

is used anywhere else, then we can not delete the row.

The above causes an ambiguity: if a dimension row exists in the data warehouse,

how can we tell if it exists in the source system? Because of that we don’t

usually delete rows from dimension tables. Instead, we mark that the row is

‘inactive’. Some people call this “soft delete”. In

each dimension table, we have several standard columns at the end: 3 timestamp

columns (last updated time, created time, and deleted time), 1 ‘IsActive’

flag column. It is this IsActive column that we update from 1 to 0 when the

row is deleted in the source system.

DDS Dimension

Dimensional Data Store (DDS) is a star schema (or snowflake) relational database

which is populated from the ODS. DDS is the used to populate the multidimensional

databases such as Cognos PowerPlay, Hyperion Essbase, Business Object Universe

or SQL Server Analysis Services. DDS is also used by report (such as Reporting

Services) and dashboard programs. Excel BI Add On and Proclarity Desktop also

connect to DDS to get dimensional data.

When we populate a dimensional table in the DDS based on the data in the ODS,

we have 2 options, e.g. there are 2 ways of doing it: a) empty and reload, or

b) incremental. Empty and reload is literally that: empty the DDS dimension

table and the reload all records from ODS. Incremental means that we load the

changed data from ODS. The change data is identified using the 3 timestamp columns

mentioned earlier.

When emptying a DDS table we should not use “delete from table”,

i.e. delete all records. Instead, we need to use “truncate table”

command. This applies to both SQL Server 2005 and Oracle 10g. Truncate is instantaneous,

because it is not logged. Delete takes a lot longer because it is logged. In

Teradata we do not have truncate. Your best bet is to use DELETE ... ALL. Multiload

delete step can also be used, but it could be a bit slower. When you use incremental

approach, try to get the 3 timestamp columns indexed if possible, as this will

significantly reduce the loading time, especially if it is a large dimension.

Generally speaking, truncate and reload is faster than incremental. If the

number of rows is below 10,000 and the dimension width (total bytes in a row)

is less than 1000, truncate and reload is faster. If your dimension has more

than 10,000 rows and the width is more than 1000, then there is a significant

chance that incremental approach can beat truncate and reload performance.

Overlay

Consider a case in the airline industry, where you have a dimension called

crew, which contains the pilots, co-pilots, navigators and cabin crew. There

are 2 source systems that can be used to populate this dimension: flight information

system (FIS) and human resources information system (HRIS). The FIS has an air¬_crew

table, which is keyed on crew_id. The HRIS has an employee table, which is keyed

on employee_id. Fortunately, the air_crew table also contains employee_id. The

employee table is richer than the air_crew table, i.e. it contains 36 attributes

including title, name, address, employment band, benefits employment start date,

normal working hours, salary, normal position, secondary position, rank, employment

status, date of birth, bank details, date created and date last modified. The

air_crew table on the other hand, only contains title, name, positions plus

2 timestamp columns. The employee table only contains permanent employee (700

records), it does not contain non-employees or temps, where as the air_crew

table contains all crews on every flights (1000 records), both temp and permanent.

The crew dimension in the data warehouse contains title, name, address, working

hours positions, date of birth, employment status and rank, plus the standard

dimensional control columns. In this case we need to load from 2 sources. This

is called overlay. How do you load with overlay? Which table is the base table

and which one is the top up? Employee table is richer by columns but not complete

by rows, air_crew is not rich but complete by rows.

As a principle we need to take the one that is complete rows as the base table

(primary) e.g. the blue box above. And the richer table by column as the top

up table (secondary), e.g. the yellow box above. Here is how it works: load

the base first, by using upsert, setting the missing attribute as their standard

values or blank. After we have a complete number of rows in the dimension table,

we then load the secondary table, by using update. We don’t do insert

or delete when loading the secondary, unless we are using SCD2 – see below.

Don’t forget to put ISNULL (if you use SQL Server) to change null into

0 or low value date or blank string.

SCD 2

Slowly Changing Dimension (SCD) type 2 is basically a technique to keep the

history on a dimension. This is done by setting the existing dimension record

as ‘outdated’ or ‘not valid anymore’ and then we create

a new record containing the new attributes and stamp it as ‘valid’,

as illustrated below:

Before the change

surrogate_key	natural_key	attribute	effective_date	expired_date	most_recent
1	10	A	1/1/1900 00:00:00	31/12/2200 00:00:00	1

After the change

surrogate_key	natural_key	attribute	effective_date	expired_date	most_recent
1	10	A	1/1/1900 00:00:00	6/10/2006 10:21:19	0
2	10	B	6/10/2006 10:21:20	31/12/2200 00:00:00	1

The concept of SCD was first described by Ralph Kimball and Margie Ross in

their book The

Data Warehouse Toolkit, chapter 4. They later on wrote a few articles on

SCD: SCD

Are Not Always As Easy As 123 (Margy Ross and Ralph Kimball, March 2005),

Delivering

Both Historical and Current Perspectives (Margy Ross, May 2004), Combining

SCD Techniques (Margy Ross, October 2000) and Processing

Slowly Changing Dimensions During Initial Data Load (Lawrence Corr, June

2000). If SCD is new for you, it would help if you familiar yourself with the

concept by reading these materials.

Now let’s talk about loading data into an SCD type 2 dimension table.

Each SCD 2 table should have effective data and expiry date columns. Otherwise

if we are asked what is the condition of the dimension at a certain date in

the past, we couldn’t answer it. Using these 2 date column we know the

condition (attribute values) at any time.

When a dimension record is created, the Effective Date column is set to a low

value date (such as 1/1/1900) and the Expired Date is set to a high value date,

such as 31/12/2100. This is to indicate that the record is valid at any time,

i.e. there is no time limit. When we have an update, we set the Effective Date

column to current time. Then we insert a new row, setting the Effective Date

column to current time and the valid to a high value date. This way, we explicitly

declare the validity period of both the old row and the new row. Please see

the table above (the one labelled as "After the change") for illustration.

The Effective Date and the Expired Date is very useful but unfortunately it

is slow. The most frequently use of an SCD 2 dimension is to support fact table

roll up with current attribute value. To do it using Effective Date and Expired

Date, our query would be something like where Effective_Date < @todays_date

and Expired_Date > @todays_date. So came the idea of most_recent column.

This is because most of the query is about the active row, not the history.

most_recent flag is binary or bit column which indicates which rows is the active

one. Now our query becomes where most_recent = 1, which is a lot quicker to

execute, especially if the most_recent column is indexed. Well, the most_recent

column should be indexed any way.

In practice SCD 2 is used in conjunction with SCD 1. Please read this article

by Ralph Kimball and Margy Ross titled "SCD

are not always as easy as 123”.

SCD 3

In Slowly Changing Dimension type 3 we keep the old attribute as a column.

Before the change

surrogate_key	natural_key	attribute1	effective_date1	attribute2	effective_date2
1	10	A	26/12/2005 00:00:00

After the change

surrogate_key	natural_key	attribute1	effective_date1	attribute2	effective_date2
1	10	B	6/10/2006 00:00:00	A	26/12/2005 00:00:00

The effective date columns help us determine the value of the attribute in

a specific point in time. Yes we need to put effective date columns for each

attribute. Yes we can have as many columns as we like for each attribute, for

example if we want to keep 5 previous values, then we need 6 pairs of columns.

In practicality this is a painful and slow process. Both for updating and for

querying. Assuming we keep 5 previous values, for every updated row, we need

to move attribute4 to attribute5 (override), attribute 3 to 4, 2 to 3, 1 to

2 and the new attribute value to 1. And the effective date columns too. See

why it’s slow?

The query is fast only if we want to know the current values, because current

values are kept in attribute1 column. But if we want to know the value of the

attribute at a particular point of time in the past, it is complicated and slow.

In practical implementation this structure is not effective to answer questions

like "give me the value of the attributes on a certain date", or rolling

up historical fact to current attribute values, so in practice SCD 3 is used

in conjunction with SCD 2, as illustrated by Margy Ross in her article Combining

SCD Techniques (October 2000), particularly her last paragraph as quoted

here:

“We've described a hybrid approach that combines the three fundamental

SCD techniques. We're creating new rows to capture change (Type 2), adding attributes

to reflect an alternative view of the world (Type 3), which are overwritten

for all earlier dimension rows for a given product (Type 1). As a student recently

suggested, perhaps we should refer to this as Type 6 (2+3+1)…”

In the next part we will continue with loading the fact tables.

Vincent Rainardi

20th October 2006