November 3, 2009 at 5:19 pm
We are creating some new update scripts that will be run on a live database. The scripts will be running on the main table for the database, so concurrency is very important.
My manager has decided to run a script that uses a cursor to absorb the contents of a select statement and then does an update row by row. I asked him about it, and he said the team had used it before because it was the best way to prevent table locks.
We aren't too concerned with speed on these queries, but is there a better way?
November 3, 2009 at 10:18 pm
Two thoughts:
- if your update is fast enough, does it really matter that it might lock the table? How many times do you plan on doing this?
- you can limit your updates to smaller sets, and use something to "walk your clustered index" in batches of 500 at a time (or 5,000 or 50,000 depending on what you happen to be doing). Again - if you do it in smal enough batches, the update will happen fast enough that any lock (page or table) won't be noticeable. Trick is finding the sweet spot (the right size to chunk into).
Something like the following (assuming for example that your clustered is on a numeric ID column):
declare @i int,
@batchsize int,
@max_I int
select @i=0,
@batchsize=10000;
select @max_I=max(ID)
from MyTable;
WHILE @i<@max_I
BEGIN
UPDATE MyTable
set whatever='blahblahblah'
where ID between @i and @i+@batchsize;
set @i=@i+@batchsize+1;
END
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 4, 2009 at 8:03 am
Basically we have clients that use the database...and they are very touchy. If any little thing happens our client support team starts getting calls....which is not acceptable to my managers.
So in a word yes bulk locks just aren't acceptable for these processes.
November 4, 2009 at 4:54 pm
And how many rows do you have in the table? Maybe snapshot transaction isolation level would be something to consider?
Regards
Piotr
...and your only reply is slàinte mhath
November 4, 2009 at 7:21 pm
huston.dunlap (11/3/2009)
We are creating some new update scripts that will be run on a live database. The scripts will be running on the main table for the database, so concurrency is very important.My manager has decided to run a script that uses a cursor to absorb the contents of a select statement and then does an update row by row. I asked him about it, and he said the team had used it before because it was the best way to prevent table locks.
We aren't too concerned with speed on these queries, but is there a better way?
"It Depends". 😉 There several ways to do the same thing but you need to identify a couple stats before you can even begin to say which method is the "best" for any given situation. Without a doubt, a nice forward-only read-only cursor driving a well written RBAR UPDATE will always give you the best concurrency possible (providing that the cursor is NOT in a transaction) but, depending on the number of rows being updated, it might not ever be able to keep up so speed may very well be a factor (and it usually is a factor even if the cursor can keep up). It all depends...
With that in mind, there are some "stat" questions we need to ask...
1. How many rows are in the table?
2. How many total rows will be updated in any given "batch"?
3. What order can the rows be updated in?
4. Does the order of update produce "groups" of rows according to the order of the Clustered Index?
5. Does the update contain any columns that any indexes contain? If so, how many columns and on which indexes and what type of indexes will be affected? Clustered, non-clustered, unique, unique with ignore-duplicates? Any Foreign keys affected?
6. How busy is the CPU and, especially, the IO system?
7. What is the structure of the table being updated?
--Jeff Moden
Change is inevitable... Change for the better is not.
November 5, 2009 at 9:40 am
One more question you might need to figure out (in addition to Jeff's questions): if you stretch out the time it takes to do these updates - do you possibly introduce a problem WHILE the update is in flight? Concurrency is definitely a factor - but querying half updated data can be much worse....
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
November 5, 2009 at 1:28 pm
Jeff Moden (11/4/2009)
With that in mind, there are some "stat" questions we need to ask...1. How many rows are in the table?
2. How many total rows will be updated in any given "batch"?
3. What order can the rows be updated in?
4. Does the order of update produce "groups" of rows according to the order of the Clustered Index?
5. Does the update contain any columns that any indexes contain? If so, how many columns and on which indexes and what type of indexes will be affected? Clustered, non-clustered, unique, unique with ignore-duplicates? Any Foreign keys affected?
6. How busy is the CPU and, especially, the IO system?
7. What is the structure of the table being updated?
1) 2.5 million
2) 300 - 50,000...depends on the job.
3) Order isn't important for the update...any order is fine.
4) I don't understand the question.
5) The update doesn't update any indexed columns.
6) This really veries depending on how many clients at the system. Sometimes it runs steady at 20%...other times we are slamming the system. Unfortuantely we can't always schedule this update around slow times, it depends on client demands.
7) The table is the main hub of the program. It has about 150 columns with a variety of varchar, nvarchar, and nvarchar (max) types other than its main keys. The table is constantly being utilized and updated by users and administrators.
One more question you might need to figure out (in addition to Jeff's questions): if you stretch out the time it takes to do these updates - do you possibly introduce a problem WHILE the update is in flight? Concurrency is definitely a factor - but querying half updated data can be much worse....
Fortunately the specific fields I am updated are not being used by the client, so I don't have to worry about dirty reads.
November 5, 2009 at 7:20 pm
In that case, I think a Cursor might be a wee bit tough on resources... doing 50,000 separate updates is going to require some CPU time and some IO time... probably what you can least afford on a busy server.
What I meant on question #4 is a bit hard to explain but can probably be best explained by one word... distribution. Let's take a 50,000 row update... where are those rows going to show up in the clustered index? Will they be fairly evenly distributed throughout the table or will they show up in "clumps" of several hundred or several thousand rows, say, by date, account/customer ID or...????
The reason why knowing the distribution of the update is important is to understand the type of locking that may occur. For example, if all 50,000 rows were in a very narrow clump, you could probably get away with doing a single update for all 50,000 rows without it locking the whole table. On the other hand, if the distubution is table wide, there's a good chance the update would escalate from simple row locks to page locks to extent locks and, eventually (and painfully) to a whole table lock. Obviously, in a high concurrency situation, you'll want to avoid that.
One of the things that can help is the fact that "simple" UPDATE's usually follow the path of the clustered index (I know... not a documented fact... prove me wrong, though ;-)). With that in mind, I'd populate a Temp table with the data that I want to update the final table with. That Temp table should have an IDENTITY column for control purposes and the data should be in the same order as the clustered index. Of course, whatever the PK column(s) is/are, the PK data should also be included in the Temp table to provide the join for the UPDATE. Additionally, the Temp Table should also have a clustered index in the same order as the final table just to make the UPDATE join a little faster.
Then, using either a cursor or just a While loop, I'd build an UPDATE "Crawler" which is a bit of an oxymoron because it's not slow in what it does... just "delayed". Have it update 1,000 rows at a time (single UPDATE statement) using the IDENTITY column to keep track of which rows to update. In the loop that does that, add a WAIT FOR DELAY of somewhere between 2 and 10 seconds. It'll blast in 1,000 row "clumps" in the same order as the clustered index which will do 3 things... first, if the clustered index is pretty well maintained, it'll reduce the movement of the read/write heads of the disk drive(s) which, of course, will make the update faster. Second, it should greatly reduce the number locks placed on the table leaving most of the table "open" for others to use. Even with that, you could probably get away with locking the whole table (saves a little time and resources because the locks don't have to escalate) because it's going to be fast (you said you're not updating any indexed columns). Third, since we're reducing the number of UPDATE statements from 50,000 to only 50, your CPU and I/O system probably won't even see the very short "spikes" of activity and neither will your users.
You might also want to look into the WITH (UPDATELOCK) table hint in Books Online to further increase concurrency.
Overall, that solution will use fewer resources than a cursor and for a much shorter cumulative time. Instead of the cursor pegging 1 CPU for somewhere 8 to 60 seconds, the UPDATE "Crawler" will make 50 barely visible (more likely invisible) blips of just a couple of milliseconds each.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 8 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply