selective Delta refresh

  • Hi, we have a set of packages which involve pulling data from a source table and performin a delta refresh on the destination table.If source table A has 10 columns col1 to col10, we pull data for the last 7 days into a staging table B and compare the data in destination table C. If any rows in source are modified (assume col1 and col2 are primary keys. col4 and col5 are the columns we are interested in comparing, we dont have to consider the other changed columns), we modify them in the destination and if any new rows are added, we add them.

    However,we end up changing data items for a row that we may not necessarily care about…

    The slowly changing dimension SSIS task goes some way to assist in making decisions about which changes to reflect and which to ignore i.e. fixed attributes but is painfully slow.

    what we are planning is introduce a binary checksum column that computes the checksum based on the attributes we do care about. Example:

    CREATE TABLE [dbo].[Live] (Id INT, Name VARCHAR(40), Grade VARCHAR(2), bCheckSum AS BINARY_CHECKSUM(Grade) );

    CREATE TABLE [dbo].[stage] (Id INT, Name VARCHAR(40), Grade VARCHAR(2), bCheckSum AS BINARY_CHECKSUM(Grade) );

    INSERT INTO TABLE [dbo].[Live] (Id, Name, Grade) VALUES (1, ‘XYZ,’G1’)

    INSERT INTO TABLE [dbo].[Staging] (Id, Name, Grade) VALUES (1, ‘XYZ,’G2’)

    SELECT * FROM [dbo].[live]

    SELECT * FROM [dbo].[Staging]

    we can add more columns to the checksum. This would enable us to identify rows where the column that we are interested have changed and hence help in update process on destination table.

    One more plan of aproach is to delete the last 7 days data from the destination and pull it again

    Please let me know your thoughts on both. apologies for a long post

    Hussain

  • Instead of a checksum why don't you implement a LastChgDt column and keep track of the fact that the row has changed and then only feed those rows into the dataflow. Your dataflow could then simply pull the records from the source to the destination staging table and then call a sproc.

    That sproc could easily detect insert vs. update and then do it.

    Depending on the mix I might do it differently. If inserts are the vast majority of the transactions I would implement it this way. I would still use a LastChgDt that limits the size of the qualified records, the process could even be made smart enough to look for records that have changed since the last time this was successfully run.. Inside the dataflow I would use a lookup component to match primary key's against records I already have in my warehouse. Those record go out the matched (success) path, these are potential update records, we'll handle that path in a second. We configure the Fail path to redirect the records that don't match, because this is SSIS 2005 the lookup task only has Success and Fail, the SSIS 2008 has Found, Not Found, and Error paths. Using the error (Fail) path we dump those records directly into the warehouse. We can do that because we KNOW they aren't there because we didn't find their primary keys. This would be pretty fast. For the success path we have more work to do, we have some choices but because I want to complete the whole process right here this is what I am going to suggest. We use the OLEDB Command transformation. I would build an update statement that updated the row if the fields we are concerned about had changed other wise not. There is a trick here. Before the OLEDB Command we need to use a derived column transform and duplicate the fields we are concerned about changing, the reason for this is that for the OLEDB Command transform we need them available to us twice and the GUI only allows us to map them once. THis makes them available a second time. The reason we need them twice is that we need them for the WHERE clause and the SET clause. An example of the OLEDB Command would look like:

    UPDATE dbo.table

    SET Field1 = ?, Field2 = ?

    WHERE PKeyField1 = ?

    AND PKeyField2 = ?

    AND ( Field1 != ? OR Field2 != ? )

    Does this make any sense?

    CEWII

  • hmm would try what you have suggested and see the performance

  • For mostly insert situations it works pretty good, if you have substantial updates it is not as fast also working with smaller windows of data always helps. In my implementation I get the max LastChgDt in the destination and only look for data in the source that is after that. That way I am only looking at things that have changed after my last run and it also handles failed runs automatically.

    CEWII

  • Elliott W (4/21/2010)


    -- and it also handles failed runs automatically.

    CEWII

    As long as they are complete failures and not partial failures ... though I'm quite sure you've got it all handled.

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

  • That is typically how I set it up so that a dataflow is all good or nothing goes in..

    CEWII

  • Elliott W (4/22/2010)


    That is typically how I set it up so that a dataflow is all good or nothing goes in..

    CEWII

    I knew I could rely on you ...

    The absence of evidence is not evidence of absence
    - Martin Rees
    The absence of consumable DDL, sample data and desired results is, however, evidence of the absence of my response
    - Phil Parkin

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply