CDC

  • Hi All,

    I am a data warehouse developer.

    I need to design and conduct an "INCREMENTAL LOAD" for my SQLSERVER 2005 source database.

    I came to know that there is an in built mechanism "CDC(CHANGE DATA CAPTURE)" is available in SQLSERVER 2008.

    Is that same kind of mechanism is enabled is SQLSERVER 2005,

    Else what would be the best approach to deal with capturing the changed data.

  • I have the same issue to do. i need to make use of CDC for the incremental loads. if you have implemented it, please share with me.

    thanks

  • hi...

    Approach 1:

    If your warehouse is not a scaled out approach you might try implementing the Lookup Values Task

    thus only getting the non-matched values and manipulating/updating the same to your warehouse from operational systems

    Approach 2:

    If it is a scaled out approach you may try implementing the Slowly Changing Dimension approach

    About CDC...

    1. The user should have a role that of admin on the operational database

    If that criteria is met...CDC is the optimum method...

    Please let me know if you wish to know more about CDC and implementation.

    Raunak J

  • Hi ...

    Let me try to explain the way our team had implemented this in one of my past projects:

    In a data flow task:

    1) OLEDB Source had the source query (Example Employee records) for the incoming recordset.

    2) Create a Look up Transformation.

    In the connection tab, use results from an SQL Query.

    Create the query such that it returns the records having the latest insert dates.

    Example:

    Select Ename, EmpId, DeptId from Emp E1 where E1.InsertDate = (Select Max(InsertDate) from Emp E2 where E1.EmpId = E2.EmpId)

    3) Map all the Primary key columns of the incoming records to the Available Lookup

    4) In the General Tab, For non-matching Rows, select Rediect Rows to Error Output

    5) This way we get the non-matching rows in the error out put. These are the new records coming in.

    Among the matching rows we can use a Conditional Split to find if the already existing records are modified or not

    The conditional split matches values for each column of the incoming records with the existing records.

    If there is a mis match in any of the column, it means the record is modified.

    Do a union all of the modified records and the new records only. Ignore the unmodified existing records.

    Insert the records from union all transformation

    ------------------
    Why not ?

  • Hello Vineet,

    That was simple...Did you try the SCD and the CDC approach.

    Also...was my response conclusive.

    Please let me know.

    Raunak J

  • vineet1234 (6/18/2010)


    5) This way we get the non-matching rows in the error out put. These are the new records coming in.

    Among the matching rows we can use a Conditional Split to find if the already existing records are modified or not

    The conditional split matches values for each column of the incoming records with the existing records.

    Vineet,

    You gave some great advice, but I would like to add that this would not be the most efficient approach for dealing with large data sets. Instead of using a conditional split to analyze each column, assign the columns a hashbyte value to check for modifications to the record.

  • Instead of using a conditional split to analyze each column, assign the columns a hashbyte value to check for modifications to the record.

    Even better...I remeber there was an article posted on the SQL Server forum about using MD5 hashbyte value to detect any updataion of data....:-D:-D:-D

    Raunak J

  • tmitchelar (6/21/2010)


    vineet1234 (6/18/2010)


    5) This way we get the non-matching rows in the error out put. These are the new records coming in.

    Among the matching rows we can use a Conditional Split to find if the already existing records are modified or not

    The conditional split matches values for each column of the incoming records with the existing records.

    Vineet,

    You gave some great advice, but I would like to add that this would not be the most efficient approach for dealing with large data sets. Instead of using a conditional split to analyze each column, assign the columns a hashbyte value to check for modifications to the record.

    If you have only SCD Type 1 (e.g. only inserts and updates), the approach of vineet1234 works wonderfully. Just join the input with your destination table on the business keys. Then, use a conditional split. If a surrogate key was found, it is an update. If not, it is an insert. Inserts go to OLE DB Destination with fast load and updates go to a temp table. Then use this temp table to perform a set based update on the destination table.

    This goes really really fast.

    However, in 2008, you can just use the T-SQL MERGE statement.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Raunak Jhawar (6/21/2010)


    Instead of using a conditional split to analyze each column, assign the columns a hashbyte value to check for modifications to the record.

    Even better...I remeber there was an article posted on the SQL Server forum about using MD5 hashbyte value to detect any updataion of data....:-D:-D:-D

    Hi Raunak.. Do you have that article?? I just want to go through it...

    ------------------
    Why not ?

  • da-zero (6/22/2010)


    tmitchelar (6/21/2010)


    vineet1234 (6/18/2010)


    5) This way we get the non-matching rows in the error out put. These are the new records coming in.

    Among the matching rows we can use a Conditional Split to find if the already existing records are modified or not

    The conditional split matches values for each column of the incoming records with the existing records.

    Vineet,

    You gave some great advice, but I would like to add that this would not be the most efficient approach for dealing with large data sets. Instead of using a conditional split to analyze each column, assign the columns a hashbyte value to check for modifications to the record.

    If you have only SCD Type 1 (e.g. only inserts and updates), the approach of vineet1234 works wonderfully. Just join the input with your destination table on the business keys. Then, use a conditional split. If a surrogate key was found, it is an update. If not, it is an insert. Inserts go to OLE DB Destination with fast load and updates go to a temp table. Then use this temp table to perform a set based update on the destination table.

    This goes really really fast.

    However, in 2008, you can just use the T-SQL MERGE statement.

    I had Inserts Only. Overall Steps:

    1) Look up destination on Business Keys

    2) Un-matched Rows go to Error outpt(These would be new records)--Blindly Insert them

    3) Matched Rows imply already existing records in destination table EITHER with same values OR changed values coming in. Here we used conditional split to determine if the values are changed.

    4) Insert the records with changed values with new Insert Dates. Ignore already existing records with no changes

    ------------------
    Why not ?

  • vineet1234 (6/22/2010)


    Raunak Jhawar (6/21/2010)


    Instead of using a conditional split to analyze each column, assign the columns a hashbyte value to check for modifications to the record.

    Even better...I remeber there was an article posted on the SQL Server forum about using MD5 hashbyte value to detect any updataion of data....:-D:-D:-D

    Hi Raunak.. Do you have that article?? I just want to go through it...

    Hi Vineet,

    please follow the link[/url]

    Raunak J

  • Vineet,

    A word of caution though,

    Considering the current scenario a professional approach would not be to use MD5 hashbyte...the reason being the overhead of converting every row of selected data to its encrypted form....this can be handled really well using T-SQL statements or for comfort the SSIS SCD Component....which according to me has a better return on investment of your time.

    Anyways, I hope the suggestions added by esteemed members must have solved the issue you and your tem were facing.

    Raunak J

  • I've used the SCD component, lookups on Type1/Type2 columns to determine new/inserts/updates and hashbytes. There is simply no comparison in performance when dealing with large data sets. Hashbytes win hands down.

    No professional would ever suggest using the SCD Component unless the data set is very small. The SCD component is notorious for poor performance and uses row at a time data calls and this is inefficient. The "overhead" to convert column values to their hashbyte form is small and far outweighs the performance lost doing a lookup on every Type1 and/or Type2 column. This becomes more true as the data set gets larger and/or more column wide and/or you have parallel processes running.

    Also, the hashbyte column in conjunction with a conditional split can ensure that only changed rows are updated which will further increase performance and give accurate updated row counts in your auditing. Yes, you can accomplish this by adding a bunch of conditions on your set based update, but with a hashbyte you kill two birds with one stone.

    And as far as investment on your time, granted I'm not the fastest at configuring the SCD component, but I can implement a package utilizing hashbytes far quicker than it takes for me to setup the SCD component and my solution will be more efficient to boot!

    I would wholeheartedly agree that the SCD is just fine for a small data set, but the performance of the hashbytes shines bright when it comes to large data sets - and I ALWAYS code for scalability....which means I never code with the SCD component. I would go so far as to say that the SCD component and the OLE DB command component are two components(due to their row-by-row calls) that should be completely left out of any professional solution.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply