Staging -> Archive: Serialized or Parallel?

  • I am doing an archiving process in the following manner:

    1) Execute SP's on ProdServer.ProdDb to extract data into ProdServer.StagingDb

    2) SSIS Transform from ProdServer.StagingDb to ArchiveServer.StagingDb

    3) Execute SP's on ArchiveServer.ArchiveDb from ArchiveServer.StagingDb

    During Step 3, there are 3 tables that I need to populate, but they reference each other in the following manner

    TableC references TableB

    TableB reference TableA

    The references are done based on identity columns generated in the Archive database.

    My question is very general, but from people who have encountered a similar situation, have you loaded all the tables one at a time (A, then B, then C), or have you predetermined the identity values and loaded them all simultaneously? Are there other things that I may need to consider when deciding which approach to take? The load on the archive server is pretty minimal - the archive & purge processes are by far the largest usage it ever sees.

  • Assuming the keys will be there is a bad idea. There is too much risk your package will break and problems tend to compound quickly when it does.

    I have known people to disable constraints, update data, fix referential integrity, and then enable constraints. This is not a method I would recommend as when I have seen it done the ETL has taken long times to update.

    Personally I have within SSIS an extract container that moves source data to stage in parallel.

    I then have a load container that transforms and loads stage data into the warehouse in dependent order.

    The use of checksums and effective indexes minimizes the time to make updates and inserts.

    You could also perform everything that already exists in parallel and quarantine the rows that fail foreign key look ups. Then you would simply have to do quarantined rows in dependent order.

    If you go down that path, you would be best to eliminate staging altogether.

    Each row would then be as below

    Calculate Checksum

    Lookup Foreign Dimension Keys using relevant business keys. Send lookup failures to Quarantine table.

    Lookup Dimension Key using business key. Lookup failures are inserted into Dimension table

    Lookup Dimension Key & Checksum. Lookup failures are updated in the Dimension table. Lookup successes are discarded.

    The Updates & Inserts would both need a step to perform any other transformations you require.

    The Quarantined rows are then go through the same process in dependent order, and missing foreign keys then point to a dummy Dimension record.

    Ideally I'd use this method myself, but SSIS does not have a native Checksum transformation (though I'm sure the custom ones out there would work fine), it moves a lot more complexity into the SSIS package making it harder to manage, whereas currently a lot of the complexity is encapsulated within the individual SPs, and finally it would simply take a large amount of time that I cannot justify until ETL performance is having a serious issue.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply