Refresh data warehouse with production data

Question

Refresh data warehouse with production data

Lord Slaagh

SSCertifiable

Points: 5796
More actions
October 18, 2019 at 2:41 pm
Go to Answer
#3690170

Hello SSC,
I want to create a package that refreshes the tables in my data warehouse with production (live) data. I have an SSIS package that simply TRUNCATES the tables and then reloads them (I did not write this package). I am not a fan of this method because we are reloading archived data for no reason. CDC is not an option at this time, and neither is REPLICATION.
So I was thinking of using a MERGE in SSIS, but I have over 200 tables. This package will be massive. Is there a more efficient way of handling this, or should I just go with the MERGE in the SSIS package? And if so, should I break the package up into pieces, or just have it in 1 package?
As always, thank you SSC for all your guidance!
Your friend,
Lord Slaagh
The are no problems, only solutions. --John Lennon
Chris Harshman

SSC-Forever

Points: 42192
More actions
October 18, 2019 at 4:34 pm
Answer
#3690231

do you have audit columns like LastModifiedTime or something that you could use to help the MERGE? I've done some data marts and smaller data warehouses using the MERGE or UPSERT script methods like you describe and compared LastModifiedTime in the WHEN MATCHED clause so that it will only UPDATE the rows that have changed. If you don't have a time or version or something to compare, then you may have to do more rigorous compares.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 1

Hi Chris,

Thank you for your quick response. Yes, we have all of that. Lastmodified, Startdate, Enddate, etc. It seems that this is the best method. I am in the process of migrating to a new DW from SQL 2012 to 2016, so I want to do this right. I could easily just keep the packages and change the data sources, but I cannot condone reloading archived data for no logical business reason.

Thank you again!

The are no problems, only solutions. --John Lennon

Chris Harshman SSC-Forever Points: 42192 More actions · Answer 2

Here's a script I use to generate a MERGE statement for such a case. Run this in the source database, and just change the first 3 lines to indicate the schema, table, and source database for the merge. In my case I kept the same table name and called the target schema ODS instead of dbo:

DECLARE@SchemaName nvarchar(128) = N'dbo'
DECLARE@TableName nvarchar(128) = N'TableName'
DECLARE@SourceDatabase nvarchar(128) = N'[ServerName].[DatabaseName]'
DECLARE@ObjectID int = OBJECT_ID(@SchemaName + '.' + @TableName)
DECLARE@HasIdentity bit
DECLARE@MaxColumnID int

SET NOCOUNT ON

-- if table has an identity column, we'll want to use INSERT_IDENTITY
SELECT @HasIdentity = CASE WHEN EXISTS(SELECT NULL FROM sys.columns WHERE object_id = @ObjectID AND is_identity = 1) THEN 1 ELSE 0 END

-- get the list of primary key fields
DECLARE @PKColumns TABLE (index_column_id int, column_name nvarchar(128), column_id int)

INSERT INTO @PKColumns (index_column_id, column_name, column_id)
SELECT ic.index_column_id, c.name, c.column_id
  FROM sys.columns c
    INNER JOIN sys.index_columns ic ON c.object_id = ic.object_id AND c.column_id = ic.column_id
INNER JOIN sys.indexes i ON i.object_id = ic.object_id AND i.index_id = ic.index_id
INNER JOIN sys.types t ON c.system_type_id = t.system_type_id
  WHERE ic.object_id = OBJECT_ID(@SchemaName + '.' + @TableName)
AND i.is_primary_key = 1

-- last column in table, to simplify script writing
SELECT @MaxColumnID = MAX(column_id)
  FROM sys.columns
  WHERE object_id = @ObjectID

-- create the actual merge script
SELECT '/* ' + @TableName + ' */', 0
UNION ALL
SELECT 'DECLARE' AS sql, 1
UNION ALL
SELECT '    @row_count int;', 2
UNION ALL
SELECT '', 3
UNION ALL
SELECT 'MERGE INTO [ODS].[' + @TableName + '] AS t', 5
UNION ALL
SELECT '  USING ' + @SourceDatabase + '.[' + @SchemaName + '].[' + @TableName + '] AS s', 6
UNION ALL
SELECT CASE WHEN index_column_id = 1
         THEN '    ON t.[' + column_name + '] = s.[' + column_name + ']'
         ELSE '    AND t.[' + column_name + '] = s.[' + column_name + ']'
       END, 6 + index_column_id
  FROM @PKColumns
UNION ALL
SELECT '  WHEN MATCHED AND s.[LastChangeDateTime] > t.[LastChangeDateTime] THEN', 99
UNION ALL
SELECT '    UPDATE SET', 100
UNION ALL
SELECT '        [' + c.name + '] = s.[' + c.name + CASE WHEN c.column_id = @MaxColumnID THEN ']' ELSE '],' END,
    100 + c.column_id AS sort
  FROM sys.columns c
  WHERE c.object_id = @ObjectID
    AND c.column_id NOT IN (SELECT column_id FROM @PKColumns)
UNION ALL
SELECT '  WHEN NOT MATCHED BY TARGET THEN', 999
UNION ALL
SELECT '    INSERT (', 1000
UNION ALL
SELECT '        [' + c.name + CASE WHEN c.column_id = @MaxColumnID THEN '])' ELSE '],' END,
    1000 + c.column_id AS sort
  FROM sys.columns c
  WHERE c.object_id = @ObjectID
UNION ALL
SELECT '      VALUES (', 2000
UNION ALL
SELECT '        s.[' + c.name + CASE WHEN c.column_id = @MaxColumnID THEN '])' ELSE '],' END,
    2000 + c.column_id AS sort
  FROM sys.columns c
  WHERE c.object_id = @ObjectID
UNION ALL
SELECT '    WHEN NOT MATCHED BY SOURCE THEN', 2999
UNION ALL
SELECT '      DELETE;', 3000
UNION ALL
SELECT '', 4000
UNION ALL
SELECT 'SET @row_count = @@rowcount;', 4001
UNION ALL
SELECT '', 4002
UNION ALL
SELECT 'INSERT INTO dbo.ODSLoadHistory', 4003
UNION ALL
SELECT '    (TableName, LoadTime, IsFullLoad, CountRows)', 4004
UNION ALL
SELECT '  VALUES', 4005
UNION ALL
SELECT '    (''' + @TableName + ''',GetDate(), 0, @row_count);', 4006
UNION ALL
SELECT 'GO', 4007
ORDER BY 2

Lord Slaagh SSCertifiable Points: 5796 More actions · Answer 3

Awesome, thank you!

The are no problems, only solutions. --John Lennon

RonKyle SSC-Dedicated Points: 31578 More actions · Answer 4

I would suggest that you bring all the records added or modified since the last run and then load the changes/additions from there. I would also suggest you avoid a merge statement. I'm not the only one to suggest this. I personally run an update for the changed records followed by an add of the new ones. The method works well and has proven to be very fast.