Ready Reckoner for top performing strategy for SCD Type II

  • Hi guys,

    I am writing a whole bunch of ETL packages to move data from operational systems to a data hub which is managing records on a type II basis (keep every copy of the record) Each record is tagged with its load date and a batch number to indicate the ETL process that was run to load the data.

    Each table has a supporting view which can be used to return the most recent version of the record

    WITH CTE AS

    (

    SELECE Max(loadDate) as 'LoadDate',PrimaryKey

    FROM someTable

    GROUP BY PrimaryKey

    )

    SELECT * from someTable ST

    JOIN CTE on CTE.PrimaryKey = ST.PrimaryKey and CTE.LoadDate = ST.LoadDate

    I know that there are three basic choices to manage SCD Type II records:

    SCD Wizard

    MergeJoin

    Lookup

    Is there a ready reckoner to calculate the likely performance of each solution without having to write and test all three.

    Most table extracts will be on a delta as I have no CDC indicators as to which records have changed. The delta could be as large as 7M records of which I would expect between 30K and 100K to change.

    If there is no ready reckoner, what factors should I be considering to estimate performance.

    MergeJoin and Lookup both require a conditional split as I have to validate that records that match on PK have the same values in all fields (yes I am getting a hash match component installed but IT have not authorized it yet)

  • SCD Wizard has terrible performance for the updates as it uses the OLE DB command.

    You can run the wizard and replace the OLE DB command with an OLE DB Destination, where you write the updates to a staging table. In a later step, you can do a regular TSQL UPDATE using an Execute SQL Task.

    However, every time you run the wizard again, to make additional changes perhaps, your adaptations are lost.

    Merge Join requires sorting, so that is usually also a no go for me.

    Usually I check with a Lookup to see if a row is an update or an insert and use a hash to determine if an update is an actual change or just the same row. I have written an article about this subject:

    Using hash values in SSIS to determine when to insert or update rows[/url]

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks Koen.

    On further consideration the default SCD wizard is not an option for me because the source is an ODBC DSN rather than a SQL server.

    MergeJoin is OK, because I can specify isSorted=True on both sources, however I still need to outer join up to 7M records so performance is not great (one source is Progress 10 and is coming across the network; the other is SQL and is probably going to be on the same physical server as the package, but storage may be on the SAN)

    lookup is slow (I am using no cache and advanced query to pinpoint the individual record. This means up to 7M individual transactional selects against a view and is a performance hog.

    I am thinking that the only way to get any performance is to load to a staging table, get the changeset using SQL and then update, but this feels like giving up 😉

  • Why are you not using the cache in the lookup component?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • PragmaticWorks have a SCD component that is a big improvement on the out-of-the-box SCD transformation...but of course you'll have to pay for it!

    You could also use a T-SQL MERGE statement.

    Regards

    Lempster

  • Lempster (3/21/2014)


    PragmaticWorks have a SCD component that is a big improvement on the out-of-the-box SCD transformation...but of course you'll have to pay for it!

    You could also use a T-SQL MERGE statement.

    Regards

    Lempster

    There's also a dimension merge component at codeplex[/url], which is also a substantial improvement over the SSIS SCD.

    I have an overview of various SCD2 methods here:

    Four Methods for Implementing a Slowly Changing Dimension in Integration Services – Part I[/url]

    Four Methods for Implementing a Slowly Changing Dimension in Integration Services – Part II[/url]

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I'm not using Cache because I wanted to use the advanced query function

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply