Refresh data warehouse with production data

  • Hello SSC,

    I want to create a package that refreshes the tables in my data warehouse with production (live) data. I have an SSIS package that simply TRUNCATES the tables and then reloads them (I did not write this package). I am not a fan of this method because we are reloading archived data for no reason. CDC is not an option at this time, and neither is REPLICATION.

    So I was thinking of using a MERGE in SSIS, but I have over 200 tables. This package will be massive. Is there a more efficient way of handling this, or should I just go with the MERGE in the SSIS package? And if so, should I break the package up into pieces, or just have it in 1 package?

    As always, thank you SSC for all your guidance!

    Your friend,

    Lord Slaagh

    The are no problems, only solutions. --John Lennon

  • do you have audit columns like LastModifiedTime or something that you could use to help the MERGE?  I've done some data marts and smaller data warehouses using the MERGE or UPSERT script methods like you describe and compared LastModifiedTime in the WHEN MATCHED clause so that it will only UPDATE the rows that have changed.  If you don't have a time or version or something to compare, then you may have to do more rigorous compares.

     

  • Hi Chris,

    Thank you for your quick response. Yes, we have all of that. Lastmodified, Startdate, Enddate, etc. It seems that this is the best method. I am in the process of migrating to a new DW from SQL 2012 to 2016, so I want to do this right. I could easily just keep the packages and change the data sources, but I cannot condone reloading archived data for no logical business reason.

    Thank you again!

    The are no problems, only solutions. --John Lennon

  • Here's a script I use to generate a MERGE statement for such a case.  Run this in the source database, and just change the first 3 lines to indicate the schema, table, and source database for the merge.  In my case I kept the same table name and called the target schema ODS instead of dbo:

    DECLARE@SchemaName nvarchar(128) = N'dbo'
    DECLARE@TableName nvarchar(128) = N'TableName'
    DECLARE@SourceDatabase nvarchar(128) = N'[ServerName].[DatabaseName]'
    DECLARE@ObjectID int = OBJECT_ID(@SchemaName + '.' + @TableName)
    DECLARE@HasIdentity bit
    DECLARE@MaxColumnID int

    SET NOCOUNT ON

    -- if table has an identity column, we'll want to use INSERT_IDENTITY
    SELECT @HasIdentity = CASE WHEN EXISTS(SELECT NULL FROM sys.columns WHERE object_id = @ObjectID AND is_identity = 1) THEN 1 ELSE 0 END

    -- get the list of primary key fields
    DECLARE @PKColumns TABLE (index_column_id int, column_name nvarchar(128), column_id int)

    INSERT INTO @PKColumns (index_column_id, column_name, column_id)
    SELECT ic.index_column_id, c.name, c.column_id
    FROM sys.columns c
    INNER JOIN sys.index_columns ic ON c.object_id = ic.object_id AND c.column_id = ic.column_id
    INNER JOIN sys.indexes i ON i.object_id = ic.object_id AND i.index_id = ic.index_id
    INNER JOIN sys.types t ON c.system_type_id = t.system_type_id
    WHERE ic.object_id = OBJECT_ID(@SchemaName + '.' + @TableName)
    AND i.is_primary_key = 1

    -- last column in table, to simplify script writing
    SELECT @MaxColumnID = MAX(column_id)
    FROM sys.columns
    WHERE object_id = @ObjectID

    -- create the actual merge script
    SELECT '/* ' + @TableName + ' */', 0
    UNION ALL
    SELECT 'DECLARE' AS sql, 1
    UNION ALL
    SELECT ' @row_count int;', 2
    UNION ALL
    SELECT '', 3
    UNION ALL
    SELECT 'MERGE INTO [ODS].[' + @TableName + '] AS t', 5
    UNION ALL
    SELECT ' USING ' + @SourceDatabase + '.[' + @SchemaName + '].[' + @TableName + '] AS s', 6
    UNION ALL
    SELECT CASE WHEN index_column_id = 1
    THEN ' ON t.[' + column_name + '] = s.[' + column_name + ']'
    ELSE ' AND t.[' + column_name + '] = s.[' + column_name + ']'
    END, 6 + index_column_id
    FROM @PKColumns
    UNION ALL
    SELECT ' WHEN MATCHED AND s.[LastChangeDateTime] > t.[LastChangeDateTime] THEN', 99
    UNION ALL
    SELECT ' UPDATE SET', 100
    UNION ALL
    SELECT ' [' + c.name + '] = s.[' + c.name + CASE WHEN c.column_id = @MaxColumnID THEN ']' ELSE '],' END,
    100 + c.column_id AS sort
    FROM sys.columns c
    WHERE c.object_id = @ObjectID
    AND c.column_id NOT IN (SELECT column_id FROM @PKColumns)
    UNION ALL
    SELECT ' WHEN NOT MATCHED BY TARGET THEN', 999
    UNION ALL
    SELECT ' INSERT (', 1000
    UNION ALL
    SELECT ' [' + c.name + CASE WHEN c.column_id = @MaxColumnID THEN '])' ELSE '],' END,
    1000 + c.column_id AS sort
    FROM sys.columns c
    WHERE c.object_id = @ObjectID
    UNION ALL
    SELECT ' VALUES (', 2000
    UNION ALL
    SELECT ' s.[' + c.name + CASE WHEN c.column_id = @MaxColumnID THEN '])' ELSE '],' END,
    2000 + c.column_id AS sort
    FROM sys.columns c
    WHERE c.object_id = @ObjectID
    UNION ALL
    SELECT ' WHEN NOT MATCHED BY SOURCE THEN', 2999
    UNION ALL
    SELECT ' DELETE;', 3000
    UNION ALL
    SELECT '', 4000
    UNION ALL
    SELECT 'SET @row_count = @@rowcount;', 4001
    UNION ALL
    SELECT '', 4002
    UNION ALL
    SELECT 'INSERT INTO dbo.ODSLoadHistory', 4003
    UNION ALL
    SELECT ' (TableName, LoadTime, IsFullLoad, CountRows)', 4004
    UNION ALL
    SELECT ' VALUES', 4005
    UNION ALL
    SELECT ' (''' + @TableName + ''',GetDate(), 0, @row_count);', 4006
    UNION ALL
    SELECT 'GO', 4007
    ORDER BY 2
  • Awesome, thank you!

     

    The are no problems, only solutions. --John Lennon

  • I would suggest that you bring all the records added or modified since the last run and then load the changes/additions from there.  I would also suggest you avoid a merge statement.  I'm not the only one to suggest this.  I personally run an update for the changed records followed by an add of the new ones.  The method works well and has proven to be very fast.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply