Un-detected deadlocks?

Question

Post reply

Un-detected deadlocks?

Neil Miller

SSC Enthusiast

Points: 120
More actions
April 25, 2006 at 2:15 pm

#152935

Hi,
I seem to have a problem that might be caused by undetected deadlocks on my system which is causing regular minute-long freezes. The way to resolve the problem is to reboot the SQL Server box, but this is inconvenient as it is a production system, and I was wondering whether any forum users might have any insight into this problem? Details below:
We have a main and a standby SQL Server 2000 enterprise edition installation (service pack 3) running on Windows 2000 service pack 4.
It has a service application (the "Eeams" server) with 21 threads connecting to it, one of these being called the "chain" thread (just a name for it) and the other 20 being called "subject" threads (again just a name).
When a job comes in to the Eeams service it is processed by one of the subject threads, calling a stored procedure called "process_subject". It updates the "subject" table (and some others) and adds an entry to the "chain" table.
Then asynchronously the "Eeams" service repeatedly calls a single "process_chain" stored procedure which processes the "chain" table entry, reads data from the "subject" table, adds other table entries, and deletes the "chain" table entry.
These stored procedures normally take under one second to run. However the Eeams server has a 60 second timeout after which it will cancel them.
For months this arrangement works OK. Occasionally there are deadlocks between the two processes, but these are handled correctly.
However every few months the service gets into a mode where every 5 minutes or so, intermittently, the system will sieze up with both of the stored procedures running, and after 60 seconds the Eeams service cancels both procedures, and the system returns to normal, processes some more items, then siezes up again.
Running a SQL Profiler trace shows that the queries do indeed take 60 seconds before being cancelled. It also shows other shorter queries from other threads running OK and taking under a second to complete.
Its as if there is a deadlock between the two processes that is not being detected. Is this a known problem with SQL Server?
Although the system is busy, with typically 2 to 6 "subject" threads running the
SQL Enterprise Manager "Current Activity" shows the "chain" thread is blocking the five "subject" threads.
The %CPU and memory are all within acceptable limits, and there is no excessive pageing.
Many Thanks

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Tom Goltl SSCertifiable Points: 5815 More actions · Answer 1

Are things indexed correctly? I'm don't know your application but you could also use the nolock hint for dirty reads.

Tom

Neil Miller SSC Enthusiast Points: 120 More actions · Answer 2

I didn't write the SQL code or indexes, but I think they are correctly configured and optimised otherwise I would have problems all the time not just after several months of running.

Tom Goltl SSCertifiable Points: 5815 More actions · Answer 3

I would not make that assumption. You're assuming that they are good programmers which isn't always the case. Some other things I'd look at: Are statistics being updated? Are indexes being rebuilt? Are table scans happening? How big are the tables?

Tom

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 4

Assume nothing !!!

Here's my 'short list' for tuning:

Round I

DBCC UPDATEUSAGE

UPDATE STATISTICS (with FULL scan) for all tables

exec sp_recompile for all tables

exec sp_refreshview for all views

Round II

DBCC DBREINDEX

UPDATE STATISTICS (with FULL scan) for all tables

exec sp_recompile for all tables

exec sp_refreshview for all views

Round III

Profiler

Query Hints

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

Neil Miller SSC Enthusiast Points: 120 More actions · Answer 5

Thanks - will try running these.

What seems wierd to me is that when the system is in this failing mode the "chain" and "subject" sp calls will run fine, each taking under a second to run, and then suddenly they will both just lock up, and will be timed out by the "Eeams" server after 60 seconds. A SQL profiler trace shows that the stored procedures took 60000 milliseconds. If it was an indexing problem then surely all calls of the stored procs would be slow, not just one or two every 5 minutes (FYI these are normally called several times a second).

Thinking aloud, maybe the problem is that the ones that complete quickly are using an index, and the slow ones are doing a table scan, for some bizarre reason.

What would be useful is to try to get SQL Profiler to produce an execution plan for each sp call it monitors. Then when the problem happens I would be able to see whether it is doing anything silly like doing a table scan. However it does not appear to be possible to do this in SQL Profiler.

I can't run the sp from Query Analyzer because the results are sent to other services and would cause loss of synchronisation with downstream systems.

I can't replicate the problem on a test system.

My money is on SQL being unable to detect that there has been a deadlock and not killing, but I may well be wrong.

Cheers again.

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 6

Before all is said and done I'd like to suggest that the issue may be 'blocking' not 'deadlocking'. I say this because of the lack of a mention of messages posted from the SQL errorlog - when SQL finds a deadlock it selects a 'victim' and 'kills' them - this process results in a brief message in the SQL Server error log. If you have some of these messages you mught look into trace flags 1204/1205. They can provide you additional information in the SQL Server errorlog. But again, the absence of therse indicative signs in the post led me to provide the 'quick list'.

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

Neil Miller SSC Enthusiast Points: 120 More actions · Answer 7

Thanks Rudy,

The SQL Error log and in fact the event log of the "Eeams" server do occasionally (once a day) post a deadlock message, and state that one of the threads has been chosen as a deadlock victim. However would it not be the case that if a deadlock occurred that SQL Server failed to detect due to a bug or the fact that it is a 3-way deadlock, or some other reason, that there would be no SQL error log message because SQL is not aware of it ocurring?

The problem might be blocking, but what is causing the procedure at the head of the blocking chain to take over 60 seconds to complete every 10 minutes or so, when between these 10 minute intervals it completes in under one second.

I'll look into the suggestions you've made.

Cheers

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 8

A deadlock can only involve 2 parties to the best of my knowledge. If there is a third (or more), I believe that they would be blocked. So it sounds as if you have deadlocks and blocks. I'd suggest implementing the trace flags quickly as startup flags. Since you say it's only happening once a day I would not worry about errorlog size issues. As for the mention of a potential 'bug', well thats always possible, but I've found SP3 to be pretty reliable. Usually when there are 'deadlock' issues I've found the root cause to be a mixture of non-optimal application and database design. Good luck !

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

tebaida SSC Journeyman Points: 87 More actions · Answer 9

since you update in one of the threads the subject table and you read from it i would offer that you try to do the updates with the (rowlock) hint and/or the reads to be done with (nolock) hint or just to set for this call the transaction isolation level to readuncommited.

Neil Miller SSC Enthusiast Points: 120 More actions · Answer 10

Well we found the cause of the problem - it was our Compaq Storageworks SAN that had three failing drives, and was giving errors. Quite why it failed in this manner is not clear - I thought that once a drive failed in a RAID configuration that was it - it was taken off line.

So now I'll have to try to find out how we can monitor for this and see how to prevent it in the future.