February 25, 2011 at 11:57 am
mazzz (2/25/2011)
I have already identified the offending session. It's one stored procedure that does quite a lot of things, and runs for a good 2-3 hours to produce some sort of report. (I've only had a quick look, it's a 3rd party application)My question at this point is, do I ask the software vendor to take a good look at their procedure? Or could something be wrong at system level that could be causing this excessive locking? We've established that lock escalation isn't disabled - anything else I can check?
Unfortunately I can't spend much time myself looking into the query itself, my boss would not be pleased with not having anyone to charge my time to, and the users of the app would not want to take on the cost either!
I think you are pretty safe in that the issue is with the database design/code and not a server issue, so I would push back on the 3rd party. (I don't know that I would expect much from them, I'm guessing they don't have many/any other clients with databases as big as yours so they haven't run into this before.)
February 25, 2011 at 12:01 pm
UMG Developer (2/25/2011)
I think you are pretty safe in that the issue is with the database design/code and not a server issue, so I would push back on the 3rd party. (I don't know that I would expect much from them, I'm guessing they don't have many/any other clients with databases as big as yours so they haven't run into this before.)
Story of our life! We manage about 100 SQL Servers, mostly containing 3rd party apps. I work for local government for a large city. Most of these apps work very well fororganisations smaller than ours, but we're the 2nd biggest such organisation in the country, and a lot of these apps don't scale well..
Thank you very much for your time and input!
February 27, 2011 at 2:00 am
mazzz (2/25/2011)
I ran sp_lock and it returned 1,500,000 rows (all key locks)
There was a nasty bug with this:
What version and build of SQL Server (e.g. 10.0.4272) are you running there?
Paul
February 27, 2011 at 3:49 am
February 27, 2011 at 3:55 am
mazzz (2/27/2011)
Thanks Paul. I'm pretty sure both Test and Live servers are at SP2 as they're both fairly recent installs, but will double check tomorrow when I'm back in the office.
I'm going to assume that means 2008 SP2.
February 27, 2011 at 3:58 am
SQLkiwi (2/27/2011)
mazzz (2/27/2011)
Thanks Paul. I'm pretty sure both Test and Live servers are at SP2 as they're both fairly recent installs, but will double check tomorrow when I'm back in the office.I'm going to assume that means 2008 SP2.
Sorry, yes (am still on 1st coffee of the day...) - 2008 SP2 x64 Standard
February 27, 2011 at 4:59 am
mazzz (2/27/2011)
Sorry, yes (am still on 1st coffee of the day...) - 2008 SP2 x64 Standard
No worries. If it's not that bug, another likely candidate is poor cardinality estimation leading SQL Server to start off taking row locks, but unable to escalate to a table lock due to other concurrent incompatible locks. The DMV sys.dm_db_index_operational_stats shows the number of lock escalation attempts and successes for an index.
February 28, 2011 at 4:55 am
Thanks Paul
The version is indeed 10.00.4000, ie SP2.
sys.dm_db_index_operational_stats show the following for the offending table/index:
leaf insert count 8283880
nonleaf insert count 26707
row lock count 12352306
page lock count 56674
index lock promotion attempt count 2
index lock promotion count 2
So, 12 million row locks and only 2 lock promotions, only 2 attempts?
This bad cardinality estimation - is this likely a query design issue or can I check for anything on the server?
Thanks again
February 28, 2011 at 5:29 am
mazzz (2/28/2011)
So, 12 million row locks and only 2 lock promotions, only 2 attempts?
Are you certain those operational statistics are for the index being used by the query in question?
Do any of the other indexes on the table show significant escalation attempts?
The index does not have table locks disabled does it (check lock_escalation value in sys.tables)?
Is it definitely a single query that takes so many row locks?
How did you count the locks?
Does the query run under the default read committed isolation level or at, say, repeatable read?
Do any loop joins in the procedure's query plan have the With(Un)OrderedPrefetch attribute? (check the properties window for that)
Is the table partitioned?
This bad cardinality estimation - is this likely a query design issue or can I check for anything on the server?
You could check that statistics are up-to-date on the table (especially if you update them manually).
An estimated query plan for SELECT COUNT(*) FROM Table should estimate roughly the correct cardinality of the table.
edit: read back a bit and saw you had already answered some of these questions.
February 28, 2011 at 6:04 am
Answers in bold inline. Thanks so much for your time with this
SQLkiwi (2/28/2011)
mazzz (2/28/2011)
So, 12 million row locks and only 2 lock promotions, only 2 attempts?Are you certain those operational statistics are for the index being used by the query in question? - I picked the objectid, indexid from the results of sp_lock
Do any of the other indexes on the table show significant escalation attempts? - There is one other index and yes, it does
The index does not have table locks disabled does it (check lock_escalation value in sys.tables)? - the value is 0
Is it definitely a single query that takes so many row locks?
How did you count the locks?Does the query run under the default read committed isolation level or at, say, repeatable read? - it runs under default, which id read committed
Do any loop joins in the procedure's query plan have the With(Un)OrderedPrefetch attribute? (check the properties window for that) - I could find no "With(un)OrderedPrefetch" text in the XML plan, if that is sufficient?
Is the table partitioned? - no
This bad cardinality estimation - is this likely a query design issue or can I check for anything on the server?
You could check that statistics are up-to-date on the table (especially if you update them manually). - hmm, looks like the update stats job was never set up for the test/dev server. They run nightly however on the Live server. Last updated on Test on 8th Feb
An estimated query plan for SELECT COUNT(*) FROM Table should estimate roughly the correct cardinality of the table. - estimated number of rows matches actual number
edit: read back a bit and saw you had already answered some of these questions.
February 28, 2011 at 7:01 am
Excellent quality answers - thank you! On the XML plan thing, I should have made it clearer that you are looking for WithOrderedPrefetch or WithUnOrderedPrefetch, not With(un)OrderedPrefetch.
So, the other index shows a lot of escalation attempts. How many, and how many were successful? I should have asked before if the statistics you provided were for the clustered index on the table.
My current guess is that the index with the escalation attempts is non-clustered and the plan includes a Key Lookup from that index, with an ordered or unordered prefetch on the lookup. If the XML plan is available and does not contain sensitive information you could email it to me at my public email address SQLkiwi@gmail.com for a deeper analysis if required.
So, I am thinking that the index locks tried to escalate but couldn't, due to concurrent activity holding an incompatible lock each time escalation was attempted. The prefetch resulted in the clustered index row locks being held longer than is ordinarily the case under read committed (exactly how much long may depend on the exact plan shape and operators).
This is my crystal-ball guess; I am quite prepared to be wrong.
Anyway, I get the impression from reading the comments earlier in this thread (not the wayward remarks about parallelism!) that this was a one-off occurrence - would that be correct, or does this excessive-locking problem happen regularly?
Paul
February 28, 2011 at 8:03 am
SQLkiwi (2/28/2011)
Excellent quality answers - thank you! On the XML plan thing, I should have made it clearer that you are looking for WithOrderedPrefetch or WithUnOrderedPrefetch, not With(un)OrderedPrefetch.
Ah, gotcha - I found one of each for nested Loops on this table - I think. I'll put my XML glasses on and take a better look!
So, the other index shows a lot of escalation attempts. How many, and how many were successful? I should have asked before if the statistics you provided were for the clustered index on the table.
Hmm, I might have misread your question in the previous post. There weren't many esclalation attempts from the clustered index on Test. On Live, there were:
(index id=1 is clustered PK)
Test:
index_id row_lock_count page_lock_count promotion attempt promotion count
1 9090 6591 2 2
5 12352306 757858 2 2
Live:
index_id row_lock_count page_lock_count promotion attempt promotion count
1 17816708 25703567 23725 10
5 16748744 1135453 1082 11
My current guess is that the index with the escalation attempts is non-clustered and the plan includes a Key Lookup from that index, with an ordered or unordered prefetch on the lookup. If the XML plan is available and does not contain sensitive information you could email it to me at my public email address SQLkiwi@gmail.com for a deeper analysis if required.
I have the plan, but it is an estimated one. More than happy to email it, if you're sure you don't mind looking at it - it's huge!
So, I am thinking that the index locks tried to escalate but couldn't, due to concurrent activity holding an incompatible lock each time escalation was attempted. The prefetch resulted in the clustered index row locks being held longer than is ordinarily the case under read committed (exactly how much long may depend on the exact plan shape and operators).
This is my crystal-ball guess; I am quite prepared to be wrong.
Anyway, I get the impression from reading the comments earlier in this thread (not the wayward remarks about parallelism!) that this was a one-off occurrence - would that be correct, or does this excessive-locking problem happen regularly?
Paul
I believe the excessive locking to have happened once before on Live due to the 1204 error being present on both Live, when it happened the first time, and of course I've seen it in Test as well now.
After the Live system failed, a CPU was removed (the 3rd party thinking parallelism was the issue), but 2GB RAM were also added at the same time, doubling the server's RAM. I believe if the RAM hadn't been added it likely wouldn't have failed.
February 28, 2011 at 9:36 am
OK, that makes sense. The 23,725 attempts and 10 successes on the clustered index on live shows that escalation from row (or page) locks to a table lock is being triggered quite frequently, but rarely succeeding - as I mentioned before, escalation can only succeed if no other process on the system holds a lock that would be incompatible with the escalated shared table lock.
The lock escalation rules are reasonably simple (see http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/05/17/lock-escalation.aspx) although they might get tweaked slightly from version to version. Escalation is first triggered when a process holds 5,000 locks at once, and is retried every time an additional 1,250 locks are added. 23,725 * 1,250 is about 30 million locks - pretty close to the 18M/26M row/page locks recorded.
Read committed normally takes a lock, reads the data, and then releases the lock - so a scan or range scanning seek does not normally hold many locks at once. Pre-fetch (read-ahead) with a lookup changes this because SQL Server has to hold the pre-fetching shared lock until the lookup is performed, and perhaps for the length of the query.
Estimated plan will be fine (not as good as actual, but it'll do). If it is huge, please zip it and rename the file something like plan.zipped because gmail doesn't like zips. There's a limit of 10 or 30MB, not sure which. If I can add anything useful from looking at it, I'll let you know (may be tomorrow now).
As far as avoiding excessive locking is concerned, you have a number of options. Partitioning the table would allow you to configure partition-level escalation. A row-versioning isolation level (most commonly read-committed snapshot isolation - RCSI) would avoid taking shared locks in almost all circumstances. If you don't need consistent or reliable results, read uncommitted is another option. I think that's about it without being able to change the query itself - though each of those has consequences and would need proper evaluation and testing.
If you could change the code, the simplest change would be to identify the processes that are preventing lock escalation and tune them to hold incompatible (data modification) locks for as short a time as possible. If you can, forget everything you read in this thread about parallelism.
Paul
February 28, 2011 at 10:27 am
SQLkiwi (2/28/2011)
OK, that makes sense. The 23,725 attempts and 10 successes on the clustered index on live shows that escalation from row (or page) locks to a table lock is being triggered quite frequently, but rarely succeeding - as I mentioned before, escalation can only succeed if no other process on the system holds a lock that would be incompatible with the escalated shared table lock.
Thanks, that makes perfect sense
The lock escalation rules are reasonably simple (see http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/05/17/lock-escalation.aspx) although they might get tweaked slightly from version to version. Escalation is first triggered when a process holds 5,000 locks at once, and is retried every time an additional 1,250 locks are added. 23,725 * 1,250 is about 30 million locks - pretty close to the 18M/26M row/page locks recorded.
Ah, I do like it when things add up!
Read committed normally takes a lock, reads the data, and then releases the lock - so a scan or range scanning seek does not normally hold many locks at once. Pre-fetch (read-ahead) with a lookup changes this because SQL Server has to hold the pre-fetching shared lock until the lookup is performed, and perhaps for the length of the query.
Thanks for the explanation
Estimated plan will be fine (not as good as actual, but it'll do). If it is huge, please zip it and rename the file something like plan.zipped because gmail doesn't like zips. There's a limit of 10 or 30MB, not sure which. If I can add anything useful from looking at it, I'll let you know (may be tomorrow now).
Done. Please don't spend much time on it - the query is part of a 3rd party application, unless something very obvious jumps out at you I'm going to have to take it back to them to sort out anyway. I spoke to one of their guys last week regarding a different issue, and they seemed to think that SELECTs won't block anything, ever, so... we shall see.
As far as avoiding excessive locking is concerned, you have a number of options. Partitioning the table would allow you to configure partition-level escalation. A row-versioning isolation level (most commonly read-committed snapshot isolation - RCSI) would avoid taking shared locks in almost all circumstances. If you don't need consistent or reliable results, read uncommitted is another option. I think that's about it without being able to change the query itself - though each of those has consequences and would need proper evaluation and testing.
If you could change the code, the simplest change would be to identify the processes that are preventing lock escalation and tune them to hold incompatible (data modification) locks for as short a time as possible.
As mentioned previously, that's something I'll have to take up with the software vendor. Thanks so much for giving me all this info
If you can, forget everything you read in this thread about parallelism.
In what sense? Forget it because it most likely won't be related to parallelism, given the info above?
The vendor told us that another of their big clients had to set MAXDOP to 1 before it could work, which was why they suggested the same to us.
Thanks again for all your help, it is very much appreciated!
February 28, 2011 at 10:36 am
mazzz (2/28/2011)
If you can, forget everything you read in this thread about parallelism.
In what sense? Forget it because it most likely won't be related to parallelism, given the info above? The vendor told us that another of their big clients had to set MAXDOP to 1 before it could work, which was why they suggested the same to us.
I said 'forget about it' because 95% of the information was just plain wrong.
Vendors are often big on MAXDOP 1. Then again, not many vendors are experts in SQL Server.
Thanks for the interesting thread, and useful huge query plan!
Paul
Viewing 15 posts - 16 through 30 (of 44 total)
You must be logged in to reply to this topic. Login to reply