September 25, 2008 at 5:14 am
Hi All,
Not sure if this should maybe have gone under the discussion section but here goes.
I have a boss that when doing huge data fixes he tends to do them with setbased code.
E.G. running a single query within and UPDATE/INSERT statement.
My question is. Would it not be better to do the INSERT/UPDATES row-by-row or in batches?
My reasoning for this is that, if the UPDATE/INSERT was for huge amounts of data then instead of getting short single row locks on the table, there will be Large numbers of rows if not the whole TABLE being locked, or perhaps even a page(if thats possible)!!!
What are you thoughts?
Thanks
Chris
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
September 25, 2008 at 5:43 am
My opinion is that size matters :-). I would recommend doing large scale data modifications in batches. But not Row-by-row! That is the opposite extreme, what you should try to find is balance.
If you need to delete several million rows, it is better to do it in "chunks" (their best size will vary depending on table and hardware) to avoid long-lasting locks. For example, when deleting 40M rows from a simple (=10 relatively narrow columns) logging table with 100M rows on our production server, I use Agent which runs every 5 minutes and deletes 100.000 rows. This way there are no locking or performance issues, even if it runs during office hours.
If we are speaking about smaller number of rows (smaller meaning e.g. 100.000), I don't see any reason to further spread this to even smaller portions.
Just a question : do you experience any problems during the workload? If not, then I guess your boss is doing it right :-). If yes, try to pinpoint what precisely causes the problem.
September 25, 2008 at 7:36 am
Size and system activity really do matter in this case. Updating/deleting millions of rows while the database is in use is probably not the best idea while batching it into smaller chunks will likely work better. Batching it also allows you to better control log file size as you can do a delete followed by a log backup. Doing a huge update/delete/insert will likely cause growth events which will slow down the server.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
September 25, 2008 at 7:39 am
It's usually best - in MS SQL Server - to default to trying to do everything in set-based operations. MS SQL is not good a row-by-row processing.
However, there are always exceptions to every rule. If you have operations that take hours to complete and lock up your entire database, it may be best to break these into manageable chunks or even into single row operations. They will probably take much longer overall, but that may be ok.
September 25, 2008 at 7:44 am
I'm suprised that Jeff Moden hasn't jumped all over this. 😉
Anyway, I think what needs to be considered is that while doing it row by row may only lock one record at a time, the overhead of looping or using a cursor plus the overhead of managing each individual row as a transaction will cause the whole process to take much longer, where as doing a single UPDATE statement set based will allow the database engine to take advantage of a number of optimization techniques that would allow it to finish the whole process alot faster. How much faster will depend of course on the WHERE clause of your UPDATE statement, and what column your clustered index is on, but it will be faster.
As for INSERT statements, I think those would only cause a problem if your clustered index was on something that would cause alot of pages to contain a mixture of old and new records, like a uniqueidentifier type column, and many of the row locks were escalated to page level locks.
September 25, 2008 at 7:51 am
Thanks for all the feedback so far guys, I'm also suprised Jeff isn't all over this he he he
----------------------------------------------
Try to learn something about everything and everything about something. - Thomas Henry Huxley
:w00t:
Posting Best Practices[/url]
Numbers / Tally Tables[/url]
September 25, 2008 at 7:58 am
Chris Harshman (9/25/2008)
I'm suprised that Jeff Moden hasn't jumped all over this. 😉
Based on some of Jeff's other posts, he is pretty busy with work right now, so mainly posts in the evenings in the Central US so I bet he'll catch this one tonight.;)
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
September 27, 2008 at 7:42 pm
Jack Corbett (9/25/2008)
Chris Harshman (9/25/2008)
I'm suprised that Jeff Moden hasn't jumped all over this. 😉Based on some of Jeff's other posts, he is pretty busy with work right now, so mainly posts in the evenings in the Central US so I bet he'll catch this one tonight.;)
Spot on, Jack... and it's actually Eastern US time although I am on the Western edge of it. 9 hour day with a 1.25 hour commute each way is taking it's toll on free time for the forum. The commute isn't bad, though. I'm going in the opposite direction of traffic so I turn off the cell phone, hit cruise control, dial in my favorite oldies station, and think... good quality time both ways.
Anyway, I agree with what most have said... RBAR should not be used for these large updates. I also agree with "segmenting" the updates. If you want to keep from locking the whole table, then you need to segment the updates according to whatever the clustered index is. For example, if you have a million rows to update in a 100 million row table (or, use a smaller example, same thing), then it would be prudent to drop things into some form of a staging table (temporary or otherwise), and mark each row for which segment of the update they will be performed in. It's a bit like a partioned table without actually being a partitioned table.
Anyway, you would mark the rows with segment numbers that grouped rows into segments based on the order of the clustered index. That way, all the updates are "close together" so far as the clustered index goes and you can usually avoid locking up the whole table.
Of course, if the target table is really that big, it might be a good idea to partition the target table, as well.
Sorry I missed such an obvious named post, but I don't go near personal or forum stuff during work hours and I certainly can't do it while I'm driving. As a result, I can't cover as many posts as I used to... my monthly posts have dropped from an average of over 800 per month to only a bit over 400. Since I'm working on some huge projects, overtime may soon come into play and my posts, very sadly, may drop to 0 for a while.
Heh... thanks for thinking of me, though. 🙂
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply