Staging Incremental Methodology - Flag Record as an INSERT, UPDATE or DELETE to the Data Warehouse Table

  • When you load the Staging table for an Incremental Load is it not common practice to do a comparison with what is in the corresponding Data Warehouse Table to populate the Staging Table?

    Also is it not common practice to mark the record in Staging with an INSERT, UPDATE or DELETE Flag so that you know what to do with the record?

    Any ideas or suggestions would be greatly appreciated.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • #1 - Yes it is common practice to compare data in staging with the data warehouse.

    #2 - Because of #1 I don't care if rows in staging are marked as INSERT or UPDATE. Since we are comparing staging to DW, the ETL process determines what type of action will take place in the DW. For example, a row inserted into a source database may or may not be an insert into your DW, depending on your data model. On the other hand, deletes need to be tracked which I would normally deal with separately in a staging table that stores the business key of the data that was deleted. This way the ETL can take the appropriate action in the DW.

    LinkedIn - http://www.linkedin.com/in/carlosbossy
    Blog - http://www.carlosbossy.com
    Follow me - @carlosbossy

  • Carlos Bossy (4/7/2012)


    #2 - Because of #1 I don't care if rows in staging are marked as INSERT or UPDATE. Since we are comparing staging to DW, the ETL process determines what type of action will take place in the DW. For example, a row inserted into a source database may or may not be an insert into your DW, depending on your data model. On the other hand, deletes need to be tracked which I would normally deal with separately in a staging table that stores the business key of the data that was deleted. This way the ETL can take the appropriate action in the DW.

    Thanks for your reply!

    I need to log how many records were Inserted, Updated or Deleted along with $ Amounts for various columns during the incermental load.

    How do I do that using your method?

    Thanks!

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Carlos Bossy (4/7/2012)


    #1 - Yes it is common practice to compare data in staging with the data warehouse.

    Don't you perform the initial load to staging. Load the DW Tables, and Clear Staging before you load the Staging with an Incremental Load?

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Depends on your ETL process to load from staging to target, but you should be able to capture INSERT/UPDATE/DELETE records from the ETL process. I used the MERGE statement and DELETE statements to handle all my ETL which allowed me to use the OUTPUT clause and capture all effected records in a temp table. I used the data in the temp table to log the count of records that were INSERTS/UPDATES/DELETES.

  • richykong (7/20/2012)


    Depends on your ETL process to load from staging to target, but you should be able to capture INSERT/UPDATE/DELETE records from the ETL process. I used the MERGE statement and DELETE statements to handle all my ETL which allowed me to use the OUTPUT clause and capture all effected records in a temp table. I used the data in the temp table to log the count of records that were INSERTS/UPDATES/DELETES.

    Could you please provide of the Sample of the OUTPUT Clause and Inserts into the Temp Table for INSERTS & UPDATES?

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Hope this helps.

    CREATE TABLEdbo.TargetTable

    ([SurrogateKey]INTIDENTITY(1,1)NOT NULL PRIMARY KEY

    ,[JoinKey]INTNOTNULL

    ,[Detail1]VARCHAR(128)NOT NULL

    ,[Detail2]VARCHAR(128)NOT NULL)

    CREATE TABLEdbo.StageTable

    ([JoinKey]INTNOTNULL

    ,[Detail1]VARCHAR(128)NOT NULL

    ,[Detail2]VARCHAR(128)NOT NULL)

    INSERT INTO dbo.TargetTable

    (JoinKey,[Detail1], [Detail2])

    SELECT1,'Test 1', 'blahblahblah'

    UNION ALL SELECT 2,'Test 2', 'blahblahblah'

    UNION ALL SELECT 3,'Test 3', 'blahblahblah'

    UNION ALL SELECT 4,'Test 4', 'blahblahblah'

    INSERT INTO dbo.StageTable

    (JoinKey,[Detail1], [Detail2])

    SELECT1,'Test A', 'blahblahblah'

    UNION ALL SELECT 2,'Test B', 'blahblahblah'

    UNION ALL SELECT 5,'Test C', 'blahblahblah'

    UNION ALL SELECT 6,'Test D', 'blahblahblah'

    --SELECT the table before the MERGE

    SELECT * FROM dbo.TargetTable

    SELECT * FROM dbo.StageTable

    DECLARE@TempASTABLE

    (ActionVARCHAR(20)

    ,JoinKeyINT

    ,Detail1VARCHAR(128)

    ,Detail2VARCHAR(128))

    MERGEdbo.TargetTableASTarget

    USINGdbo.StageTableASSource

    ONTarget.JoinKey=Source.JoinKey

    WHEN MATCHED

    AND(Target.[Detail1]<>Source.[Detail1]

    ORTarget.[Detail2]<>Source.[Detail2])

    THEN

    UPDATE

    SET[Detail1]=Source.[Detail1]

    ,[Detail2]=Source.[Detail2]

    WHEN NOT MATCHED

    THEN

    INSERT

    (JoinKey

    ,Detail1

    ,Detail2)

    VALUES(Source.JoinKey

    ,Source.Detail1

    ,Source.Detail2)

    --- You can reference 'Source', 'Target', 'Inserted', or 'Deleted' to output from.

    OUTPUT$action

    ,Inserted.JoinKey

    ,Inserted.Detail1

    ,Inserted.Detail2

    INTO @Temp;

    --SELECT The results of the merge action.

    SELECT *

    FROM @Temp

    SELECT * FROM dbo.TargetTable

    SELECT * FROM dbo.StageTable

    DROP TABLE dbo.TargetTable

    DROP TABLE dbo.StageTable

  • richykong (7/20/2012)


    ...I used the MERGE statement and DELETE statements to handle all my ETL...

    I do agree on the MERGE statement but I do not on the use of DELETE - by definition we do not want to delete any data from the core Data Warehouse tables.

    The fact that a client is gone as got deleted from the OLTP source system does not means it has to be deleted in the Data Warehouse. Data Warehouse has to provide the means to see what happened on the past then Data Warehouse has to keep a record for the now extinct client (probably including a active-until date column) and also has to keep all the transactions associated with it.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Thanks for the responses.

    I totally agree and understand that you do not want to delete records but flag them as inactive.

    As far as the OUTPUT Clause could you please go into more detail and provide more specific examples?

    Thanks.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Welsh Corgi (7/21/2012)


    I totally agree and understand that you do not want to delete records but flag them as inactive.

    In general the inactive flag is not enough because it doesn't tell the whole storey... inactive since when? was it active two years ago? two month ago? two hours ago?

    Stuff that can go inactive should include a starting and an ending timestamp to let the system know when was active.

    This is important because business sooner or later would ask for some business metrics comparing this quarter with the same quarter last year and most probably the quantity of some-stuff (like customers) active during the compared periods would be part of the equation.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Whan I said flag them as inactive I dod not mean to store an "Inactive Flag", I mean to indicate that they were inactive as a particular Date, etc.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Welsh Corgi (7/21/2012)


    Whan I said flag them as inactive I dod not mean to store an "Inactive Flag", I mean to indicate that they were inactive as a particular Date, etc.

    Shame on me :crying: I'm sorry I didn't realize it was you asking the question, man - I have to pay attention.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Carlos Bossy (4/7/2012)


    #1 - Yes it is common practice to compare data in staging with the data warehouse.

    #2 - Because of #1 I don't care if rows in staging are marked as INSERT or UPDATE. Since we are comparing staging to DW, the ETL process determines what type of action will take place in the DW. For example, a row inserted into a source database may or may not be an insert into your DW, depending on your data model. On the other hand, deletes need to be tracked which I would normally deal with separately in a staging table that stores the business key of the data that was deleted. This way the ETL can take the appropriate action in the DW.

    What "ETL process"? I believe the OP is trying to write one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • PaulB-TheOneAndOnly (7/21/2012)


    Welsh Corgi (7/21/2012)


    I totally agree and understand that you do not want to delete records but flag them as inactive.

    In general the inactive flag is not enough because it doesn't tell the whole storey... inactive since when? was it active two years ago? two month ago? two hours ago?

    Stuff that can go inactive should include a starting and an ending timestamp to let the system know when was active.

    This is important because business sooner or later would ask for some business metrics comparing this quarter with the same quarter last year and most probably the quantity of some-stuff (like customers) active during the compared periods would be part of the equation.

    +1000

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • PaulB-TheOneAndOnly (7/21/2012)


    Welsh Corgi (7/21/2012)


    Whan I said flag them as inactive I dod not mean to store an "Inactive Flag", I mean to indicate that they were inactive as a particular Date, etc.

    Shame on me :crying: I'm sorry I didn't realize it was you asking the question, man - I have to pay attention.

    If it makes you feel any better, I saw "flag them as inactive" and thought your clarification was very worthwhile in case someone else thought of something like an "IsActive" flag.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply