June 3, 2015 at 1:53 pm
I am writing an ETL that will update records in a table that is fairly sizable. It has up to 10million records (not BIG data but big enough). The number of columns I am updating is relatively small. It occurs to me that rather than just update all of the records in the target table, perhaps it would be better to stage records that I come up with using the EXCEPT operator.
My choices are:
1) stage all records and update all target records even if nothing has changed in any of the columns;
2) stage all records and only update the target records that are returned through EXCEPT comparing the staging columns to the target columns.
On the face of it, it might be better to use EXCEPT, but I don't know what overhead that operation might add, especially since it is new for me.
Any suggestions other than to try it both ways?
Thanks in advance,
Diana
June 3, 2015 at 2:03 pm
Can't you use a WHERE clause for the UPDATE to only update the rows that need updating, without going the EXCEPT route?
For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]
June 3, 2015 at 2:19 pm
Hmm... that would be interesting to test but I'd imagine the overhead of doing the except, staging those results, then running the resulting update would be greater than just running the original updates even if there are unnecessary updates.
Ideally you'd control it from the source side by not sending you records with no changes to begin with 😀
June 3, 2015 at 2:41 pm
Alvin Ramard (6/3/2015)
Can't you use a WHERE clause for the UPDATE to only update the rows that need updating, without going the EXCEPT route?
Generally when you are doing these kinds of updates, you want to update the record when ANY column has changed. To implement that with a WHERE clause requires using ORs which are horribly inefficient. Also, if any of the columns are NULLABLE, that complicates the criteria when using a WHERE clause.
When you use an EXCEPT clause it automatically handles records where any combination of columns has changed and it also handles NULL values. So, when updating more than one column, I almost always use an EXCEPT clause.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
June 3, 2015 at 10:32 pm
dbodell (6/3/2015)
I am writing an ETL that will update records in a table that is fairly sizable. It has up to 10million records (not BIG data but big enough). The number of columns I am updating is relatively small. It occurs to me that rather than just update all of the records in the target table, perhaps it would be better to stage records that I come up with using the EXCEPT operator.My choices are:
1) stage all records and update all target records even if nothing has changed in any of the columns;
2) stage all records and only update the target records that are returned through EXCEPT comparing the staging columns to the target columns.
On the face of it, it might be better to use EXCEPT, but I don't know what overhead that operation might add, especially since it is new for me.
Any suggestions other than to try it both ways?
Thanks in advance,
Diana
What is the Primary Key of the table? Are there AKs (Alternate Keys) as well that you could use to isolate only the rows that need to be updated? Or have you already done that?
--Jeff Moden
Change is inevitable... Change for the better is not.
June 5, 2015 at 10:21 am
The primary and alternate keys really aren't a factor in this, unless I am not understanding the point. I mean, there are three ways to go: 1) update all of the records, even if there hasn't been a change, because the data set is still small enough that this isn't a performance issue; 2) use WHERE clause to filter out records where none of the columns have changed -- a pain in the butt to check all of the columns individually and to handle NULLs ; or 3) use an EXCEPT statement to determine which records have been changed and just update those.
My real question is whether using EXCEPT adds significant overhead to the process. Everything we do in SQL is a balancing act between performance/efficiency and getting the job done. So sometimes what may seem like it is going to improve things often does not because of the way SQL Optimizer works. Does anyone know whether EXCEPT is a particularly expensive operation?
June 5, 2015 at 12:24 pm
dbodell (6/5/2015)
My real question is whether using EXCEPT adds significant overhead to the process. Everything we do in SQL is a balancing act between performance/efficiency and getting the job done. So sometimes what may seem like it is going to improve things often does not because of the way SQL Optimizer works. Does anyone know whether EXCEPT is a particularly expensive operation?
As with most performance questions, there are just too many variables to give a definitive answer. The primary factor is the number of columns you are updating, and therefore the number of OR statements you would need if you don't use an EXCEPT clause.
Another factor that can affect performance is whether there is a covering index for the fields being updated along with the fields used for the join. Also, the size of the update in comparison to the entire table can affect performance if the optimizer decides that the size is large enough to justify a table or index scan rather than an index seek.
EXCEPT clauses can perform well, but that doesn't mean they will always perform well or that they will perform well in your particular case.
Drew
J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA
June 5, 2015 at 2:45 pm
dbodell (6/5/2015)
The primary and alternate keys really aren't a factor in this, unless I am not understanding the point. I mean, there are three ways to go: 1) update all of the records, even if there hasn't been a change, because the data set is still small enough that this isn't a performance issue; 2) use WHERE clause to filter out records where none of the columns have changed -- a pain in the butt to check all of the columns individually and to handle NULLs ; or 3) use an EXCEPT statement to determine which records have been changed and just update those.My real question is whether using EXCEPT adds significant overhead to the process. Everything we do in SQL is a balancing act between performance/efficiency and getting the job done. So sometimes what may seem like it is going to improve things often does not because of the way SQL Optimizer works. Does anyone know whether EXCEPT is a particularly expensive operation?
Correct. You might be missing the point.
As you say, for a small set, it's probably cheaper just to do wholesale replacement. For large sets, I load the changes into a staging table and mark each row with what I want to do with that row. Of course, to do that, you either need to have PK or AK data in the staging table and in the target table.
The advantage is that you're not tying up the table in any way until you're ready to do the final "upsert" and, since it's all based on a well indexed PK or AK, is usually very fast.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 5, 2015 at 3:12 pm
I see what you're saying. I actually do stage the data first and the staging table does have the same PK as the target table. Performance isn't terrible at all for a 90 million record table. I drop most of the indexes before the update and insert, leaving only the index needed to join the staging and target tables efficiently.
I start out using a last update timestamp on the source data to pull out recently changed accounts. I don't know what has changed about the account, though, without doing a comparison of some sort with the target table. Right now the only comparison I am doing is determining if the account record exists in the target (in which case update all columns) or not (insert a new record). Since the source data being pulled can be as many as 25million records for an incremental load, I was just wondering if it makes sense to whittle down the records I stage to those where there actually was a change to a column I care about. You see, the account could be included in the pull because something changed in the source tables that is completely unimportant to my ETL. So it is very likely that a much smaller subset of the accounts that I end up pulling actually have changes in the columns that matter to me. That's why I was considering the EXCEPT to load the staging table.
Thanks for your input!
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply