If you’re a data warehouse developer, chances are you use T-SQL Merge statement to process slowly changing dimensions. If you’ve coded a couple of Merge statements, you might’ve noticed that most of that code is redundant. By saving source to target data map and other metadata before hand, you can automatically generate Merge statements to load slowly changing dimensions. The Merge that’s generated supports type0, type1, and type2 dimensions.
Let’s see it in action!
I’ve got a staging and a dimension table – a typical setup for implementing Merge pattern.
CREATE TABLE [stg].[Customer] ( ID NVARCHAR(15) NOT NULL , DOB [Date] NOT NULL , LName [nvarchar](50) NULL , Height INT NULL , Email NVARCHAR(255) NOT NULL , City VARCHAR(100) NOT NULL ) GO CREATE TABLE Customer.DimCustomer ( [CustomerKey] [int] IDENTITY(1, 1) NOT NULL , [CustomerIdNK] [nvarchar](25) NULL , [DateofBirth] DATE NOT NULL , [CustomerLastName] NVARCHAR(50) NOT NULL , [CustomerHeight] INT NULL , [EmailAddress] NVARCHAR(255) NULL , [CustomerCity] VARCHAR(100) NOT NULL , [IsRowActive] BIT NOT NULL , [RowStartDate] DATETIME NOT NULL , [RowEndDate] DATETIME NULL , CONSTRAINT [PK_DimCustomer_CustomerKey] PRIMARY KEY CLUSTERED ( [CustomerKey] ASC ) , CONSTRAINT [NK_DimCustomer_CustomerIdNK_RowStartDate] UNIQUE NONCLUSTERED ( [CustomerIdNK], [RowStartDate] ASC ) ) GO
Data Map
Now I create another table called DimDataMap to store source (staging table) to target (dimension table) mapping, in addition to other metadata that determines the join conditions and dimension type needed to build the Merge statement.
CREATE TABLE [devtools].[DimDataMap] ( [DimDataMapID] [int] IDENTITY(1, 1)NOT NULL , [DstSchema] [varchar](15) NULL , [DstTable] [varchar](50) NULL , [DstColumn] [varchar](50) NULL , [SrcSchema] [varchar](15) NULL , [SrcTable] [varchar](50) NULL , [SrcColumn] [varchar](50) NULL , [ColumnType] [varchar](10) NULL , [SortId] [int] NULL , [UpdatedDt] [datetime] NULL , [UpdatedBy] [varchar](50) NULL , CONSTRAINT [PK_DimDataMap_DimDataMapID] PRIMARY KEY ( [DimDataMapID] ) , CONSTRAINT [U_DimDataMap_DstSchema_DstTable_DstColumn] UNIQUE ( [DstSchema], [DstTable], [DstColumn] ) ); ALTER TABLE [devtools].[DimDataMap] ADD DEFAULT (GETDATE()) FOR [UpdatedDt]; ALTER TABLE [devtools].[DimDataMap] ADD DEFAULT (SUSER_SNAME()) FOR [UpdatedBy]; ALTER TABLE [devtools].[DimDataMap] WITH CHECK ADD CHECK (([ColumnType]='Type2' OR [ColumnType]='Type1' OR [ColumnType]='Type0' OR [ColumnType]='HK' OR [ColumnType]='SK' OR [ColumnType]='NK'));
The columns, DstSchema, DstTable, DstColumn and SrcSchema, SrcTable, SrcColumn simply represent mapping between the staging and the dimension table.
It’s important to discuss the ColumnType column. As the name indicates, it’s used to define the type of the dimension column. I, Like any other developer, love Where clauses and Check constraints. So, I defined a Check constraint on ColumnType to make sure you don’t go crazy and throw whatever you want. Only a certain values are allowed:
- Type2, if the dimension column is a type2 attribute
- Type1, if the dimension column is a type1 attribute
- Type0, if the dimension column is a type0 attribute
- HK, if the dimension column is a house keeping attribute (columns like IsRowActive, RowStartDate, RowEndDate that are used to maintain history
- SK, if the dimension column is a surrogate key
- NK, if the dimension column is a natural or business key
If you’re still unclear about what’s stored in the data map, take a look at the following table. It contains metadata for the staging and dimension table we created earlier.
You’re being warned
Let me make a few things clear here: there’s no such things as a free lunch! Inserting data to this data map table and maintaining it is hard. Your fingers will give you a death stare from all the typing. You’ll even let a few cuss words slip while writing the insert statements. If you’re are working with one or two dimensions, you’re better off writing the Merge statement directly instead of attempting to automate it, but we never have only one dimension in a project. Moreover existing dimensions often change and we often alter the Merge to account for the changes. In that case, automating gives best bang for yo buck.
usp_WriteMerge — a.k.a make magic happen
We’ve done a lot of hard work so far. We created sample staging and dimension tables, created a DimDataMap, inserted source to target mapping and other metadata about the type of the column. We’re ready to generate the Merge statement.
devtools.usp_WriteMerge is a stored procedure that takes the dimension table as input, reads all the information it needs from the data map and outputs the Merge statement.
EXEC devtools.usp_WriteMerge @DstSchema = 'customer', @DstTable = 'DimCustomer';
Here’s the result. You can copy and execute this to create the load stored procedure.
Download from CodePlex
Code used for this demo including the more important DimDataMap (data map table) and usp_WriteMerge (actual SP that generates Merge statement) is available for download from CodePlex. Go to http://tsqlmergedimension.codeplex.com/SourceControl/latest and click on the download button.
The download is a sql file that creates a brand new database, schemas and tables for this demo. To use this in your environment, you need to create the objects from devtools schema in the database of your choice. There are also a few limitations you need to be aware of that are included in the comments of usp_WriteMerge. Make sure to check that for more information.
I hope you’ll find this script useful. If you do, feel free to send a million bucks my way or leave a comment. Whichever is easier for you!