Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS
Surrogate-keys are imperative
to data warehousing. The performance advantages of implementing surrogate-keys along
with their usefulness in deploying slowly changing dimensions are well
documented in many industry standard books. Yet there are complexities in
implementing them such as maintaining referential integrity without hurting the
speed of the surrogate-key replacement pipeline and correctly sequencing the
surrogate-keys for the data coming in at irregular intervals. These challenges can
impede the data warehouse availability timelines.
SQL Server Integration Services
provides a framework for efficiently developing such essential data-warehousing
ETL mechanisms. This article provides an innovative approach for generating surrogate-keys
for Type 1 and Type 2 slowly changing dimensions and then implementing a highly
scalable, high-throughput surrogate-key replacement pipeline for an EDW using
SSIS that scales on enterprise-class hardware for a highly parallelized, large
scale data warehouse load.
Prerequisites
Introductory knowledge of fact tables, dimensions
and star schema is useful.
What is a Type 1 or Type 2 dimension?
As found in some popular
articles, Type 1 dimension is the most effective way of presenting the AS-IS
view of the business in a data-warehouse but it does not maintain any history
of the dimensional changes. Type 2 dimension, on the other hand, helps in tracking
history. It not only provides the AS-IS view, but also the AS-WAS view of the
business for any point in time. Reference and additional information on slowly
changing dimensions is provided on the link: http://www.dbmsmag.com/9604d05.html
What are Surrogate-Keys?
A standard data warehouse implementation
consists of generating surrogate-keys for all members in the dimensions. Natural keys or codes coming from a source
OLTP system can be quite large in size and/ or inconsistent in nature. Data warehouse architects like to replace
these source keys with a nice four byte integer in order to ensure consistency,
as well as reduce the size of the tables.
These keys also help in the creation of Type 1 and Type 2 slowly
changing dimensions.
Process Overview
The process involves generating
unique surrogate-keys for every dimension member in the data warehouse, managing
referential integrity, and then constructing a surrogate-key pipe-line for
replacing all the incoming fact data with the newly generated surrogate keys. Reference
and additional information on surrogate-keys is provided on the following link:
http://www.dbmsmag.com/9806d05.html
The techniques in this article
tie together some of the industry standard concepts and ideas available in
popular books, blogs and articles with the technological features of SSIS 2005.
The example used here has been simplified for easy understanding, and presented
in the most granular way so that it can be leveraged for any complex and
large-scale objectives.
The slowly changing dimension
task (wizard) that comes with SSIS does not generate surrogate keys. Also,
using an identity column is not a scalable approach for generating surrogate-keys
because it requires constant reading and writing to the database. In order to
make a scalable solution, we need to read from the database just once to get
the highest value of existing surrogate keys, and then programmatically
generate new keys without having to read from the database again. The following
link serves as a reference for this concept:
http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx
The tables below show example
columns for Type 1 and Type 2 dimension tables. These tables contain one
natural key column (CODE), one surrogate-key column (SK) and one attribute column
(DESC). The Type 2 dimension table additionally contains a creation timestamp
(TS) and a current record identifier (CURR).
Figure : Sample Type 1
dimension
Figure : Sample Type 2
dimension
Note that dimension member with
code C has multiple records in the Type 2 dimension table. Every time a member incurs a change, a new
record is created to track the history of changes. Type 1 dimension, in contrast, only keeps the
last incurred change. Also note that dimension
member with code Z in Figure 2 was at one point unknown (see SK = 6). It means that the code was first encountered
in the fact data before it showing up in the definition lists feeding the
dimension tables. Code Z was therefore considered an Early Fact and its
attributes were assigned an unknown status. The correct attribute definitions were
imparted to it in the next ETL runs as they became available (see SK = 8).
In order to implement the surrogate-key
mechanism, three Data Flow tasks are created on the Control Flow window enumerated
for updating attribute changes, inserting new dimension members and inserting
early facts. For a rapidly changing dimension, the updates should precede the
inserts. But for relatively slower changing dimension, running inserts prior to
updates will prove quicker.
Figure : Data Flow Task 1: Update Attrib Changes
The attribute updating process
starts with querying the source (staging data) for the dimensions, and filtering
out the existing dimension members through a Lookup task based on the natural keys, and then writes the list of new
(or non-existing) dimension members to a Raw
File. These Raw Files must be
placed on high performance storage subsystems since they would require fast
I/O.
From the existing dimension
members, a subsequent Lookup task
picks only the dimension records where the attribute changes have occurred and
then feeds them to an OLE DB command
task that updates dimension table with changes (see Figure 3).
The lookups for attribute
changes can be further optimized by using Checksum
values instead of column to column comparisons.
OLE DB Command Transform uses a Native
OLE DB\SQL Native Client connection for updating the attributes. The
following SQL command updates the dimension data by taking changed attributes
as parameters.
UPDATE [dbo].[DIM_dimensionname_Type1]
SET [dimensionname_DESC] = ? WHERE
[dimensionname_CODE] = ?
Clustered index on the
dimension tables might be helpful in such update tasks, if used tactically and
dropped during the insert tasks.
Figure 4: Data Flow Task 2: Insert New Dims
The previous
data flow task (1) had detected the new dimension members and written them to a
Raw File, eliminating the need for repeating
the lookup.
These new dimension members stored
in the Raw File (see Figure 4) are now
passed through a Script task that
generates surrogate-keys and then inserts them into the dimension table via an OLE DB destination task. The inserting
process is sped up by avoiding check constraints.
A sample script for generating the
surrogate-keys for new dimensions are displayed below. Please note that there
is an internal connection formed for retrieving the highest existing surrogate-key
value from within the script through the .NET
Provider\SqlClient Data Provider (see Sub AcquireConnection on line 14).
Script Transform Script
Imports System
Imports System.Data
Imports System.Data.SqlClient
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Declare the variables
Public Class
ScriptMain
Inherits UserComponent
Dim Counter As Integer = 0
Dim connMgr As
IDTSConnectionManager90
Dim sqlConn As
SqlConnection
Form .net database connection
Public Overrides Sub AcquireConnections(ByVal
Transaction As Object)
connMgr = Me.Connections.PLAYDWNET
sqlConn = CType(connMgr.AcquireConnection(Nothing), SqlConnection)
End Sub
Initialize the counter
Public Overrides Sub Input0_ProcessInputRow(ByVal
Row As Input0Buffer)
Counter =
Counter + 1
Row.dimensionnameSK = Counter
End Sub
Preexecute to fetch the highest existing sk
Public Overrides Sub PreExecute()
MyBase.PreExecute()
Dim sqlComm As New SqlCommand("select
max(dimensionname_SK) as LAST_SK from DIM_dimensionname_Type1",
sqlConn)
Dim r As
SqlDataReader = sqlComm.ExecuteReader()
While r.Read()
Counter = CInt(r("LAST_SK"))
End While
r.Close()
End Sub
End Class
Variations for the Type 2 Dimension
In contrast to a Type 1
dimension, a Type 2 dimension design inserts a new record for attribute changes
hence retaining the history records instead of updating them. The current record
identifier for each existing dimension record where the change took place is
changed to false or 0 and the same dimension member is inserted again with a
new surrogate-key along with the changed attributes, the creation timestamp and
a true or 1 value for the current record identifier.
Prior to a lookup, the existing
dimension members are filtered down to the current records only using the query
below in order to economize the operations on a multi-million member Type 2
dimension.
SELECT
dimensionname_CODE
, dimensionname_DESC
, dimensionname_CURR
FROM DIM_dimensionname_Type2
WHERE dimensionname_CURR = 1
The resulting list is used to
identify the new and the changed members, and accordingly, the current record identifiers
for the expiring records are updated to 0 via the following statement:
UPDATE [dbo].[DIM_dimensionname_Type2]
SET [dimensionname_CURR] = 0 WHERE [dimensionname_CODE] = ?
New surrogate-keys are
generated along with the current creation timestamps and the current record identifiers
are set to 1 using the following snippet:
Script Transform Script
Counter = Counter + 1
Row.dimensionnamesurrogate-key =
Counter
Row.dimensionnameCURR = 1
Row.dimensionnameTS = Now()
Figure 5: Data Flow Task 3: Insert
Unknown for Early Fact
The incoming fact records typically
land in staging tables or in flat files depending on the architecture. A group by operation on the fact records can
get a unified list of the incoming dimension members. This list can then be used in a lookup to
identify and handle all the dimension members that are still unknown to the
dimension tables and can cause referential integrity problems.
Script Transform Script
Counter = Counter + 1
Row.dimensionnamesurrogate-key =
Counter
Row.dimensionnameDESC = "Unknown"
Row.dimensionnameCURR = 1
Row.dimensionnameTS = Now()
There can be three options for
handling referential integrity:
- During extraction, using an outer-join with
the dimension tables;
- Before insert, using the aggregate transform to
obtain a list of early facts;
- During insert, using a special lookup table
updated in cache to track the unknowns.
This article describes the No.
2 approach, which has an overhead of aggregation (group by) prior to the
pipeline but has the advantage of simplicity later.
Even if there are multiple
unknown natural keys at one time, a separate surrogate-key is assigned to each
of them they can be differentiated later. Blanks and nulls must be explicitly
handled in a dimensional design by specifying a valid description such as
Blank and ensuring that it accounts for the business rules or anomalies in
the transactional systems that generated blank values in the first place.
How to Use the Type 2 Dimension
Subselect SQL Query
AS-IS view of the data:
The following query brings all
the active records.
select
dimensionname_SK
, dimensionname_CODE
, dimensionname_DESC
from dbo.DIM_dimensionname_Type2
where dimensionname_CURR = 1
order by dimensionname_CODE
AS-WAS view of the data:
The following query brings all valid
records, as of a particular date:
With Tempquery (WAS_SK, MAX_TS) AS
(select
max(dimensionname_surrogate
key) as
dimensionname_WASSK
,max(dimensionname_TS) as dimensionname_MAXTS
from dbo.DIM_dimensionname_Type2
where dimensionname_TS <=
('2005-11-19')
group by
dimensionname_CODE)
select
dimensionname_SK
, dimensionname_CODE
, dimensionname_DESC
from dbo.DIM_dimensionname_Type2
where
dimensionname_SK
in (select WAS_SK from Tempquery)
order by dimensionname_CODE
More information on using Type
2 dimenesion is provided on the following link:
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2000%20/KimballDT8Perfectly.pdf
Further Enriching the Type 2 Dimension
Other things can be added to
enrich the Type 2 dimension:
- Expiration Date: This article uses only the Creation Date
as an example. Storing Expiration
Date can be helpful in AS-WAS queries.
- Change Version Number: A change number to the dimension records
can also be assigned to quickly track how many times a record has been
versioned, starting from 0.
Figure 6: Surrogate-Key Pipeline
In the pipeline seen in Figure
6, all the surrogate-key replacements on each fact data row for every conformed
dimension occur uninterruptedly. As soon as a row gets a replacement for one
lookup and moves on to the next lookup, the subsequent row follows suit. A
cluster of rows are written to the disk only after they go though their surrogate
key replacements in memory. The dimension lookup tables are cached for faster
access. Moreover, both incoming fact data and lookup tables can be presorted
for efficient replacements.
We can even have multiple surrogate-key
replacement pipelines running in parallel using conformed dimensions to meet an
aggressive service level agreement. Since
referential integrity was already resolved, the replacement process then became
very effective.
Conclusion
This article covered an
uncomplicated yet highly scalable method of generating surrogate-keys for Type
1 and Type 2 dimensions, and then replacing the natural keys in the fact data with
the appropriate surrogate-keys by using a high performance surrogate-key pipeline.
SSIS provides a graciously scalable framework for implementing such data
warehousing procedures that can take advantage of multi-processor servers in
scaling up.