"poor man's replication" ETL process design

  • TheSQLGuru (1/11/2015)


    Most of my healthcare work has been directly for services providers or insurers, although I have worked directly with Centricity and a few others, especially on data movement stuff. LOTS of work on the EDI data sets.

    I prefer to roll my own "change capture stuff" personally, and have a system for that. CDC is a BEAST from a performance and space perspective and is often unnecessary.

    Glad you have things working more quickly.

    I figured CDC must have been tossed aside for some reason. Would you care to elaborate on the system you've created for change capture?

  • I can't speak for Kevin but I roll my own, as well. Two different types but both use carefully created, hardcoded triggers. Notice that I didn't say "hand coded". Especially if it's a "Changed Column(s) Only) type of audit table, I'll write code to capture the changes for each column. I've also been known to create a staging table to accept the output of the triggers and then blast them into the final table once a minute or so. It helps keep people reading the audit table from slowing other processes down.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • bantrim (1/11/2015)


    Jeff Moden (1/11/2015)


    Ah... understood on the OR.

    I'm thinking that a UNIQUE NIX on the ID and Hash (and that combination should be unique because the ID is unique) might be worth trying.

    Good idea, that certainly couldn't hurt, and will probably give a significant boost for the larger tables. Thanks!

    Never, EVER, think that creating an index can't harm your system!!! There are MANY ways it can do so!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • bantrim (1/11/2015)


    TheSQLGuru (1/11/2015)


    Most of my healthcare work has been directly for services providers or insurers, although I have worked directly with Centricity and a few others, especially on data movement stuff. LOTS of work on the EDI data sets.

    I prefer to roll my own "change capture stuff" personally, and have a system for that. CDC is a BEAST from a performance and space perspective and is often unnecessary.

    Glad you have things working more quickly.

    I figured CDC must have been tossed aside for some reason. Would you care to elaborate on the system you've created for change capture?

    The VAST majority of ETL processes do not care about "this record changed N times during the day (or whatever your refresh interval is)". They simply want a snapshot of the state of data at a given point (0223 each morning for example). So I use exceedingly simple triggers to populate "driver tables" that contain the PK column(s) and sometimes a date/time field of some type and maybe a tinyint for "type of operation" depending on the client's actual needs. Then I use that at population time to get what I need from the table and do the classic UPSERT to move all necessary DELETE/UPDATE/INSERT (in that order) activity into the table copy. This is ALWAYS done via index seeks on the base table, so locking/blocking is VERY restricted. This is definitely simplified from what I actually build, but that is the fundamental underpinnings.

    What Jeff mentioned seems more like column-level auditing and I do that pretty much just like he does too.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Using hash or checksum techniques to detect modifications to bulk numbers of records is inefficient (it has to read and hash all columns) and not reliable (data changes but hash is the same). Consider using an indexed TIMESTAMP column, which gets updated automatically for each row insert or update. Then have an ETL run log table that keeps track of max timestamp pulled for previous run which can be used as an offset for pulling rows into the next run.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • But how would any of this be a "poor man's" version of replication? Even SQL Server Standard Edition supports real replication, which may work better than anything you spend time implementing.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (1/12/2015)


    Using hash or checksum techniques to detect modifications to bulk numbers of records is inefficient (it has to read and hash all columns) and not reliable (data changes but hash is the same). Consider using an indexed TIMESTAMP column, which gets updated automatically for each row insert or update. Then have an ETL run log table that keeps track of max timestamp pulled for previous run which can be used as an offset for pulling rows into the next run.

    If only it were so easy. No fields can be added at the source. Definitely, what your suggesting is perfect in an ideal situation, but this is far from ideal.

    Not sure how much of my threads you have seen/followed, but this is a difficult situation, crossing different companies and, of course, HIPPA policies. We are working towards getting a user group formed between the companies to take care of things like this, but at the moment, we need the data before any movement will happen on that front.

  • bantrim (1/12/2015)


    Eric M Russell (1/12/2015)


    Using hash or checksum techniques to detect modifications to bulk numbers of records is inefficient (it has to read and hash all columns) and not reliable (data changes but hash is the same). Consider using an indexed TIMESTAMP column, which gets updated automatically for each row insert or update. Then have an ETL run log table that keeps track of max timestamp pulled for previous run which can be used as an offset for pulling rows into the next run.

    If only it were so easy. No fields can be added at the source. Definitely, what your suggesting is perfect in an ideal situation, but this is far from ideal.

    Not sure how much of my threads you have seen/followed, but this is a difficult situation, crossing different companies and, of course, HIPPA policies. We are working towards getting a user group formed between the companies to take care of things like this, but at the moment, we need the data before any movement will happen on that front.

    Yes, I can vouch from personal experience that, Healthcare and Federal Government, the two industries that can potentially benefit most from information technology and integration, they are probably still the most backward in that regard. If I could re-do the last 15 years ...

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (1/12/2015)


    bantrim (1/12/2015)


    Eric M Russell (1/12/2015)


    Using hash or checksum techniques to detect modifications to bulk numbers of records is inefficient (it has to read and hash all columns) and not reliable (data changes but hash is the same). Consider using an indexed TIMESTAMP column, which gets updated automatically for each row insert or update. Then have an ETL run log table that keeps track of max timestamp pulled for previous run which can be used as an offset for pulling rows into the next run.

    If only it were so easy. No fields can be added at the source. Definitely, what your suggesting is perfect in an ideal situation, but this is far from ideal.

    Not sure how much of my threads you have seen/followed, but this is a difficult situation, crossing different companies and, of course, HIPPA policies. We are working towards getting a user group formed between the companies to take care of things like this, but at the moment, we need the data before any movement will happen on that front.

    Yes, I can vouch from personal experience that, Healthcare and Federal Government, the two industries that can potentially benefit most from information technology and integration, they are probably still the most backward in that regard. If I could re-do the last 15 years ...

    You and me both.....it's amazing, and kindof scary to be honest, once you're on in inside.

    I suppose that's also what makes it exciting -- that there's so much that can be improved.

  • Jeff Moden (1/11/2015)


    I can't speak for Kevin but I roll my own, as well. Two different types but both use carefully created, hardcoded triggers. Notice that I didn't say "hand coded". Especially if it's a "Changed Column(s) Only) type of audit table, I'll write code to capture the changes for each column. I've also been known to create a staging table to accept the output of the triggers and then blast them into the final table once a minute or so. It helps keep people reading the audit table from slowing other processes down.

    I'm going to have to show my colleague yours and Kevin's posts since, to her, triggers are four-letter-words. We've gone back and forth since we first met over this one. They DO have their uses, but have to be used right!

  • bantrim (1/12/2015)


    Jeff Moden (1/11/2015)


    I can't speak for Kevin but I roll my own, as well. Two different types but both use carefully created, hardcoded triggers. Notice that I didn't say "hand coded". Especially if it's a "Changed Column(s) Only) type of audit table, I'll write code to capture the changes for each column. I've also been known to create a staging table to accept the output of the triggers and then blast them into the final table once a minute or so. It helps keep people reading the audit table from slowing other processes down.

    I'm going to have to show my colleague yours and Kevin's posts since, to her, triggers are four-letter-words. We've gone back and forth since we first met over this one. They DO have their uses, but have to be used right!

    NICE, GOOD, COOL are 4-letter words! 🙂

    Seriously though, I actually had a client GO OUT OF BUSINESS because of triggers!! And the stupid thing is that I TOLD THEM IT WOULD HAPPEN 14 months before it did (and before they went live with the product)!!! Sigh. I have also seen poor triggers bring systems to their knees. But I have seen clients and companies go out of business for LOTS of other stuff too. And there is a nearly INFINITE array of ways you can cause poor performance!

    Use the right tool for the job I always say, and sometimes a trigger is just that. And I have no problem telling a client that SQL Server is NOT the best tool for whatever it is they are trying to do either. No zealotry here.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • bantrim (1/12/2015)


    I'm going to have to show my colleague yours and Kevin's posts since, to her, triggers are four-letter-words. We've gone back and forth since we first met over this one. They DO have their uses, but have to be used right!

    Compared to adding a trigger or CDC on each table, simply adding an indexed timestamps would be safer (less like to break something), less intrusive (easier to maintain and less impact on performance), and would ultimately be a more natural fit the job of periodically pulling across records that have been inserted or updated.

    You should really consider talking the DBA on the other side to add timestamps. If there is a development environment to test with, then it should be a simple matter to add the columns and perform an integration test to insure it doesn't break any existing ETL or reports.

    Also, for the issue where it takes 50 minutes to pull across 3,000 rows, if you're using remote queries with 4 part naming convention, then instead try to find a way to re-write that using pass-through style queries using EXEC () AT <servername>.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (1/12/2015)


    bantrim (1/12/2015)


    I'm going to have to show my colleague yours and Kevin's posts since, to her, triggers are four-letter-words. We've gone back and forth since we first met over this one. They DO have their uses, but have to be used right!

    Compared to adding a trigger or CDC on each table, simply adding an indexed timestamps would be safer (less like to break something), less intrusive (easier to maintain and less impact on performance), and would ultimately be a more natural fit the job of periodically pulling across records that have been inserted or updated.

    You should really consider talking the DBA on the other side to add timestamps. If there is a development environment to test with, then it should be a simple matter to add the columns and perform an integration test to insure it doesn't break any existing ETL or reports.

    Also, for the issue where it takes 50 minutes to pull across 3,000 rows, if you're using remote queries with 4 part naming convention, then instead try to find a way to re-write that using pass-through style queries using EXEC () AT <servername>.

    Sorry, but I STRONGLY disagree that adding an indexed column on a table is any of the things you state, especially the less likely to break things.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (1/12/2015)


    Eric M Russell (1/12/2015)


    bantrim (1/12/2015)


    I'm going to have to show my colleague yours and Kevin's posts since, to her, triggers are four-letter-words. We've gone back and forth since we first met over this one. They DO have their uses, but have to be used right!

    Compared to adding a trigger or CDC on each table, simply adding an indexed timestamps would be safer (less like to break something), less intrusive (easier to maintain and less impact on performance), and would ultimately be a more natural fit the job of periodically pulling across records that have been inserted or updated.

    You should really consider talking the DBA on the other side to add timestamps. If there is a development environment to test with, then it should be a simple matter to add the columns and perform an integration test to insure it doesn't break any existing ETL or reports.

    Also, for the issue where it takes 50 minutes to pull across 3,000 rows, if you're using remote queries with 4 part naming convention, then instead try to find a way to re-write that using pass-through style queries using EXEC () AT <servername>.

    Sorry, but I STRONGLY disagree that adding an indexed column on a table is any of the things you state, especially the less likely to break things.

    So you believe that adding a timestamp and non-clustered index to a table is more likely to break production than adding a trigger than inserts into an audit table? Yes, it will add additional I/O to maintain the non-clustered index, but so would the trigger and audit table, and the audit table itself would need an index. What specifically is your concern?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • So you believe that adding a timestamp and non-clustered index to a table is more likely to break production than adding a trigger than inserts into an audit table? Yes, it will add additional I/O to maintain the non-clustered index, but so would the trigger and audit table, and the audit table itself would need an index. What specifically is your concern?

    Yes, I do believe the trigger is less likely to cause issues (and especially break an app) than adding a rowversion column. MANY applications out there will work improperly when a column is added to a table. Here is a short list of issues:

    1) SELECT * queries. This one alone is a complete deal-breaker in many apps I see in the wild.

    2) Every index on a table and data change that occurs increases the potential for a deadlock on said table.

    3) Views on table could be affected.

    Other issues off the top of my head:

    1) Complete non-supportability by vendor. Yes, a trigger is likely non-supported as well but it is a LOT easier to disable/drop a trigger if necessary before applying a patch, calling the support hotline, etc.

    2) Fragmentation caused by additional column added to table. This could require a rebuild to address, with potential outage required if cannot be done online.

    3) increased maintenance/stats time/cost

    4) increased backup size duration. And yes, those 8 bytes (plus the SAME 8 bytes AND the clustering key on the NC index) for every row WILL add up, especially on larger tables and/or ones with unfortunate clustering keys.

    5) bloated memory usage by same stuff in 4) above --> less useful stuff in RAM --> more physical IO --> worse performance

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply