Incremental load - using timestamp reliable?

  • We have a transactional database with every table containing create_date and update_date columns.

    We designed a data warehouse and developed ETL process (initial and incremental).

    Is it a valid concern to use create_date/ update_date for incremental ETL process or a better approach to simply load all records new/ modified by comparing to existing DW?

    Are there any advantages/ downsides to either strategy?

    Thanks Alexander.

  • Shurkadze (8/19/2011)


    We have a transactional database with every table containing create_date and update_date columns.

    We designed a data warehouse and developed ETL process (initial and incremental).

    Is it a valid concern to use create_date/ update_date for incremental ETL process or a better approach to simply load all records new/ modified by comparing to existing DW?

    Are there any advantages/ downsides to either strategy?

    After initial load ETL should be incremental.

    By design DWH system stores a wider range of data than OLTP systems do therefore not all the data is available on the OLTP system.

    The secret is how to identify what has to be moved from OLTP to DWH staging tables, your row created/updated timestamp is a very valid and reliable strategy for extracting new data from OLTP system.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/19/2011)


    We have a transactional database with every table containing create_date and update_date columns.

    We designed a data warehouse and developed ETL process (initial and incremental).

    Is it a valid concern to use create_date/ update_date for incremental ETL process or a better approach to simply load all records new/ modified by comparing to existing DW?

    Are there any advantages/ downsides to either strategy?

    After initial load ETL should be incremental.

    By design DWH system stores a wider range of data than OLTP systems do therefore not all the data is available on the OLTP system.

    The secret is how to identify what has to be moved from OLTP to DWH staging tables, your row created/updated timestamp is a very valid and reliable strategy for extracting new data from OLTP system.

    What if you have a row modified in OLTP with a column value changed that is not mapped to DW. The data then should not be moved to Data Warehouse? Correct? What is the use of timestamp then?

  • Shurkadze (8/22/2011)


    What if you have a row modified in OLTP with a column value changed that is not mapped to DW. The data then should not be moved to Data Warehouse? Correct? What is the use of timestamp then?

    how about anticipated growth? it may not map to the data warehouse today, but a future upgrade or functionality may require it?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (8/22/2011)


    Shurkadze (8/22/2011)


    What if you have a row modified in OLTP with a column value changed that is not mapped to DW. The data then should not be moved to Data Warehouse? Correct? What is the use of timestamp then?

    how about anticipated growth? it may not map to the data warehouse today, but a future upgrade or functionality may require it?

    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

  • Shurkadze (8/22/2011)


    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

    Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?

    An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.

    Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.

    Use your timestamp so sort out DWH ETLable candidates - you will sleep better.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

    Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?

    An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.

    Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.

    Use your timestamp so sort out DWH ETLable candidates - you will sleep better.

    I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.

  • Shurkadze (8/22/2011)


    PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

    Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?

    An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.

    Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.

    Use your timestamp so sort out DWH ETLable candidates - you will sleep better.

    I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.

    do you mean... comparing row-by-row, column-by-column? :w00t:

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

    Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?

    An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.

    Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.

    Use your timestamp so sort out DWH ETLable candidates - you will sleep better.

    I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.

    do you mean... comparing row-by-row, column-by-column? :w00t:

    You can use

    SELECT *

    INTO #TEMP

    FROM

    (SELECT

    col1,

    col2,

    col3

    FROM

    OLTP_TABLE

    EXCEPT

    SELECT

    col1,

    col2,

    col3

    FROM

    DW_Dim_TABLE) a

    This will create temp table with results of differences.

  • Shurkadze (8/22/2011)


    PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

    Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?

    An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.

    Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.

    Use your timestamp so sort out DWH ETLable candidates - you will sleep better.

    I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.

    do you mean... comparing row-by-row, column-by-column? :w00t:

    You can use

    SELECT *

    INTO #TEMP

    FROM

    (SELECT

    col1,

    col2,

    col3

    FROM

    OLTP_TABLE

    EXCEPT

    SELECT

    col1,

    col2,

    col3

    FROM

    DW_Dim_TABLE) a

    This will create temp table with results of differences.

    I've several DIM tables over 500 millons rows in size - do you really think this has the ability to perform in a daily basis?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (8/23/2011)


    Shurkadze (8/22/2011)


    PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    PaulB-TheOneAndOnly (8/22/2011)


    Shurkadze (8/22/2011)


    The DW contains small subset of columns of OLTP. Relying on timestamp may not be meaningful since only a small subset of columns changed should trigger load into DW.

    Why not simply load all records from OLTP not in DW regardless of values of timestamp? What is the downside of that?

    Okay, let's say you have a "customers" table on OLTP side and "dim_customers" on DWH side - is that a fair scenario?

    An old customer, Acme Corp moves from White Plains, NY to Encino, CA so OLTP system alters some columns on that particular row.

    Acme Corp row is in both the OLTP system and the DWH system... so according to your theory changes are not moved! Even if this is clearly a SCD of the Type 2 change.

    Use your timestamp so sort out DWH ETLable candidates - you will sleep better.

    I will then find this record by comparing data in OLTP to one in DW and handle it accordingly in DW.

    do you mean... comparing row-by-row, column-by-column? :w00t:

    You can use

    SELECT *

    INTO #TEMP

    FROM

    (SELECT

    col1,

    col2,

    col3

    FROM

    OLTP_TABLE

    EXCEPT

    SELECT

    col1,

    col2,

    col3

    FROM

    DW_Dim_TABLE) a

    This will create temp table with results of differences.

    So you have dimension tables with 500M+ records? Or is this a hypothetical question?

  • Would it be possible to do a checksum on the fields you're interested in on both source and DW; if they're different, then you need to do an update or add row for your slowly changing dimension? You could even store the checksum on the DW if that's faster than recalculating. Would that be performant enough for your scenario?

  • Shurkadze (8/23/2011)So you have dimension tables with 500M+ records? Or is this a hypothetical question?

    Nothing hypothetical about it - actually one of the DIM tables holds 1.2 billion rows.

    How do you think your proposed solution will scale in a medium size environment like the one I'm describing?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • rgtft (8/24/2011)


    Would it be possible to do a checksum on the fields you're interested in on both source and DW; if they're different, then you need to do an update or add row for your slowly changing dimension? You could even store the checksum on the DW if that's faster than recalculating. Would that be performant enough for your scenario?

    Why just don't use PK as everybody does? run incremental ETL processes staging only newly added/updated rows?

    What's the point on trying to complicate a simple process like this one?

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply