ETL question: which option is better for performance

  • I am writing an ETL that will update records in a table that is fairly sizable. It has up to 10million records (not BIG data but big enough). The number of columns I am updating is relatively small. It occurs to me that rather than just update all of the records in the target table, perhaps it would be better to stage records that I come up with using the EXCEPT operator.

    My choices are:

    1) stage all records and update all target records even if nothing has changed in any of the columns;

    2) stage all records and only update the target records that are returned through EXCEPT comparing the staging columns to the target columns.

    On the face of it, it might be better to use EXCEPT, but I don't know what overhead that operation might add, especially since it is new for me.

    Any suggestions other than to try it both ways?

    Thanks in advance,

    Diana

  • Can't you use a WHERE clause for the UPDATE to only update the rows that need updating, without going the EXCEPT route?



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Hmm... that would be interesting to test but I'd imagine the overhead of doing the except, staging those results, then running the resulting update would be greater than just running the original updates even if there are unnecessary updates.

    Ideally you'd control it from the source side by not sending you records with no changes to begin with 😀

  • Alvin Ramard (6/3/2015)


    Can't you use a WHERE clause for the UPDATE to only update the rows that need updating, without going the EXCEPT route?

    Generally when you are doing these kinds of updates, you want to update the record when ANY column has changed. To implement that with a WHERE clause requires using ORs which are horribly inefficient. Also, if any of the columns are NULLABLE, that complicates the criteria when using a WHERE clause.

    When you use an EXCEPT clause it automatically handles records where any combination of columns has changed and it also handles NULL values. So, when updating more than one column, I almost always use an EXCEPT clause.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • dbodell (6/3/2015)


    I am writing an ETL that will update records in a table that is fairly sizable. It has up to 10million records (not BIG data but big enough). The number of columns I am updating is relatively small. It occurs to me that rather than just update all of the records in the target table, perhaps it would be better to stage records that I come up with using the EXCEPT operator.

    My choices are:

    1) stage all records and update all target records even if nothing has changed in any of the columns;

    2) stage all records and only update the target records that are returned through EXCEPT comparing the staging columns to the target columns.

    On the face of it, it might be better to use EXCEPT, but I don't know what overhead that operation might add, especially since it is new for me.

    Any suggestions other than to try it both ways?

    Thanks in advance,

    Diana

    What is the Primary Key of the table? Are there AKs (Alternate Keys) as well that you could use to isolate only the rows that need to be updated? Or have you already done that?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The primary and alternate keys really aren't a factor in this, unless I am not understanding the point. I mean, there are three ways to go: 1) update all of the records, even if there hasn't been a change, because the data set is still small enough that this isn't a performance issue; 2) use WHERE clause to filter out records where none of the columns have changed -- a pain in the butt to check all of the columns individually and to handle NULLs ; or 3) use an EXCEPT statement to determine which records have been changed and just update those.

    My real question is whether using EXCEPT adds significant overhead to the process. Everything we do in SQL is a balancing act between performance/efficiency and getting the job done. So sometimes what may seem like it is going to improve things often does not because of the way SQL Optimizer works. Does anyone know whether EXCEPT is a particularly expensive operation?

  • dbodell (6/5/2015)


    My real question is whether using EXCEPT adds significant overhead to the process. Everything we do in SQL is a balancing act between performance/efficiency and getting the job done. So sometimes what may seem like it is going to improve things often does not because of the way SQL Optimizer works. Does anyone know whether EXCEPT is a particularly expensive operation?

    As with most performance questions, there are just too many variables to give a definitive answer. The primary factor is the number of columns you are updating, and therefore the number of OR statements you would need if you don't use an EXCEPT clause.

    Another factor that can affect performance is whether there is a covering index for the fields being updated along with the fields used for the join. Also, the size of the update in comparison to the entire table can affect performance if the optimizer decides that the size is large enough to justify a table or index scan rather than an index seek.

    EXCEPT clauses can perform well, but that doesn't mean they will always perform well or that they will perform well in your particular case.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • dbodell (6/5/2015)


    The primary and alternate keys really aren't a factor in this, unless I am not understanding the point. I mean, there are three ways to go: 1) update all of the records, even if there hasn't been a change, because the data set is still small enough that this isn't a performance issue; 2) use WHERE clause to filter out records where none of the columns have changed -- a pain in the butt to check all of the columns individually and to handle NULLs ; or 3) use an EXCEPT statement to determine which records have been changed and just update those.

    My real question is whether using EXCEPT adds significant overhead to the process. Everything we do in SQL is a balancing act between performance/efficiency and getting the job done. So sometimes what may seem like it is going to improve things often does not because of the way SQL Optimizer works. Does anyone know whether EXCEPT is a particularly expensive operation?

    Correct. You might be missing the point.

    As you say, for a small set, it's probably cheaper just to do wholesale replacement. For large sets, I load the changes into a staging table and mark each row with what I want to do with that row. Of course, to do that, you either need to have PK or AK data in the staging table and in the target table.

    The advantage is that you're not tying up the table in any way until you're ready to do the final "upsert" and, since it's all based on a well indexed PK or AK, is usually very fast.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I see what you're saying. I actually do stage the data first and the staging table does have the same PK as the target table. Performance isn't terrible at all for a 90 million record table. I drop most of the indexes before the update and insert, leaving only the index needed to join the staging and target tables efficiently.

    I start out using a last update timestamp on the source data to pull out recently changed accounts. I don't know what has changed about the account, though, without doing a comparison of some sort with the target table. Right now the only comparison I am doing is determining if the account record exists in the target (in which case update all columns) or not (insert a new record). Since the source data being pulled can be as many as 25million records for an incremental load, I was just wondering if it makes sense to whittle down the records I stage to those where there actually was a change to a column I care about. You see, the account could be included in the pull because something changed in the source tables that is completely unimportant to my ETL. So it is very likely that a much smaller subset of the accounts that I end up pulling actually have changes in the columns that matter to me. That's why I was considering the EXCEPT to load the staging table.

    Thanks for your input!

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply