I wouldn’t class myself as an expert in SSIS but I certainly know my way around but came across something today which I thought I’d share. As with a lot of things there are “many ways to skin a cat”, none of which is something I’ll go into at the moment but what i will concentrate on is updating columns in a table where the data has changed in the source.
One of the projects I’m currently working on requires this very process and when i set about doing so I created the T-SQL Merge statement to do the business. However, the question was raised as to why I didn’t use SSIS’s built in component Slowly Changing Dimension (SCD)? I didn’t really have an answer other than personal preference but decided to delve into it a bit further and compare the performance of each method.
As a test, I created a source table with a Key and Name column:
USE TempDB; IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.iSource') AND type in (N'U')) DROP TABLE dbo.iSource; CREATE TABLE dbo.iSource ( ID INT, Name varchar(100) ); IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.iTarget') AND type in (N'U')) DROP TABLE dbo.iTarget; CREATE TABLE dbo.iTarget ( ID INT, Name varchar(100) );
and populated it with some dummy data:
INSERT INTO dbo.iSource (ID,Name) SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY t.object_id) AS rownumber ,'Name_'+convert(varchar(4),ROW_NUMBER() OVER (ORDER BY t.object_id)) FROM sys.tables t CROSS JOIN sys.stats s; INSERT INTO dbo.iTarget (ID,Name) SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY t.object_id DESC) AS rownumber --Done in descending order ,'Name_'+convert(varchar(4),ROW_NUMBER() OVER (ORDER BY t.object_id)) FROM sys.tables t CROSS JOIN sys.stats s; SELECT ID, Name FROM iSource; SELECT ID, Name FROM iTarget;
So we now have a source and target table with different Names and we’ll look to update the iTarget table with the information coming from iSource.
Method 1 – MERGE Statement
MERGE dbo.iTarget AS target USING ( SELECT ID, Name FROM dbo.iSource ) AS source (ID, Name) ON (target.ID = source.ID) WHEN MATCHED AND target.Name <> source.Name THEN UPDATE SET Name = source.Name WHEN NOT MATCHED THEN INSERT (ID, Name) VALUES (source.ID, source.Name);
Using this method simply in SSMS for simplicity, profiler output 2 rows for Batch Starting and Batch Completing, CPUTime of 125ms and Duration of 125ms and it updated 6678 records. Top stuff, as expected.
Method 2 – SSIS SCD Component
I rebuilt the tables to put them back to where we started and set about creating the same thing in SCD setting ID as the business key and Name as the changing attribute and not setting inferred members, below is a screen dump of the outcome of this:
I clear down the profiler and run the ssis package and the outcome is quite astounding.
The profiler output 13456 rows including 6678 rows of queries like this:
exec sp_executesql N'SELECT [ID], [Name] FROM [dbo].[iTarget] WHERE ([ID]=@P1)',N'@P1 int',8
as well as 6678 rows of queries similar to this:
exec sp_execute 1,'Name_3304',3304
Total Duration of 37 seconds (yes that’s seconds not ms!!)…….and this is on a table of only ~7k rows!
Well I’ll be damned, the SCD basically runs a cursor looping each record checking for a match on ID and updating that record if so. I can’t actually believe that MS have built a component which performs in this way.
So, to answer the question asked ” why I didn’t use SSIS’s built in component Slowly Changing Dimension (SCD)?”, I now have a definitive answer, it doesn’t perform!
I’m sure SCD has its place but for me, the requirements and the datasets I’m working on I think I’ll stick with MERGE for now….. 🙂
NOTE: This was done on SQL Server 2008R2 Developer Edition running on Windows 7 Ultimate, not sure if SQL Server 2012 has improved the SCD performance but I’ll leave that for another day.