Very high spikes in Current disk Queue Length

Question

Very high spikes in Current disk Queue Length

paulr-562242

SSC Veteran

Points: 215
More actions
November 26, 2007 at 7:18 pm

#179561

We have a Managed SQL server 2000 cluster running against a SAN disk set. At occasional peroids we seem to suffer a very high number of SQL Deadlocks and performance issues. I decided to monitor the SQL server and during this peroid I was monitiring Avg Disk Queue Length, Current disk queue length., Disk bytes/sec, SQL transactions/sec and a few others. Now normally the Disk Q sits around .05 to.5 but at certain times I noticed the the current queue length would climb to around 850+ over 25 seconds with no significant increase in bytes/sec or trans/sec. Could this be a SAN config issue? What areas can I investigate considering this is a Managed server and only have restrcited remote access to the server. I can provide a screenshot of the spike.
The support company say the stats are irrelvant and the issue is with the application and they can't see anything with the system. They are also saying that the table locks on the database are causing the disk locks which I understand to be impossible. They seem resistant to logic.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

thecosmictrickster@gmail.com SSChampion Points: 10386 More actions · Answer 1

Are the I/Os caused by read or write activity? Is there any pattern to the occurrences? Is it definitely SQL Server & not some other rogue process (e.g. a virus scanner, though it's not recommended to have one on clustered machines)

Run Profiler when it happens (preferably start before it happens) and see what queries are running (especially long-running ones). Or use sp_who2 & DBCC INPUTBUFFER to see what's currently running.

That's a start.

Scott Duncan

MARCUS. Why dost thou laugh? It fits not with this hour.
TITUS. Why, I have not another tear to shed;
--Titus Andronicus, William Shakespeare

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 2

At occasional peroids we seem to suffer a very high number of SQL Deadlocks and performance issues.

It's not the disk... it's a chunk or two of code somewhere. Heh... You don't happen to have a "sequence table" anywhere do you? Server log will tell you a fair amount of info if you have trace flag 1204 turned on... see Books Online under "deadlocks, troubleshooting, deadlocks involving locks". Well worth the trip...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

paulr (11/26/2007)
We have a Managed SQL server 2000 cluster running against a SAN disk set. At occasional peroids we seem to suffer a very high number of SQL Deadlocks and performance issues. I decided to monitor the SQL server and during this peroid I was monitiring Avg Disk Queue Length, Current disk queue length., Disk bytes/sec, SQL transactions/sec and a few others. Now normally the Disk Q sits around .05 to.5 but at certain times I noticed the the current queue length would climb to around 850+ over 25 seconds with no significant increase in bytes/sec or trans/sec. Could this be a SAN config issue? What areas can I investigate considering this is a Managed server and only have restrcited remote access to the server. I can provide a screenshot of the spike.
The support company say the stats are irrelvant and the issue is with the application and they can't see anything with the system. They are also saying that the table locks on the database are causing the disk locks which I understand to be impossible. They seem resistant to logic.

If the other advice doesn't help already -

Well - it technically IS possible that the SAN is either saturated or misconfigured so that you would see this type of behavior. We used to have horrendous backlogs that would happen because some EMC buckaroo told our folks that two busy Exchange and SQL Server instances would live very well on the SAME arbitrated loop SAN....Let's just say that WRONG doesn't begin to describe the wrongness of theat info...

Of course - the SAN monitoring software, or its utils should be able to tell you that something is up (like in our case - the cache on the controller was entirely full AND there were collisions on the loops because the actitiyt was too high).

On the other hand - the phrase "managed server" tends to scare me a little since it somehow evokes "virtual" server to me. Virtual servers can be a B**CH to monitor, since there could be any number of other loads from other instances you don't see that can be causing your issue.

Finally (and still most likely by far) - there's some kind od activity causing this to start. What do the SQL logs say? We had a LOT of errors/warning popping up in the logs that pointed to some cryptic message meaning in English "your disk subsystem sucks and isn't keeping up" (like once every few seconds). Consdering how often we saw it (while we had the disks set up badly) - it's amazing there weren't timeouts everywhere.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

1) Use Task Manager to see what process is actually doing the I/Os.

2) If you have win2k3, there are a number of bugs that will flush RAM (large file copy, HP iLO driver, others). This will lead to tremendous page file hits for a short period.

3) Monitor waitstats and other sql internal information to see if sql server is being affected by and/or causing this issue.

Key is to know the root cause of the massive I/O spikes. Until you know that you can't do anything useful.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Brandie Tarvin SSC Guru Points: 173080 More actions · Answer 5

The first question you should ask of everyone who has access to the server is "What has changed?". This might help you pinpoint the problem faster. I'm assuming these spikes only started happening recently and have not been going on since the cluster / san setup was created, right? If they have been going on since creation, it probably is the SAN.

Next step: Definitely use Profile to see if SQL Server is causing the problem. You might want to run System Monitor side-by-side for the I/O just to see if the spikes occur at the same time as certain SQL queries.

If you were using SQL 2005, you could sync the two outputs just to verify & clarify. But since you're using 2000, you'll just have to watch them.

If SQL Server doesn't appear to be doing it, then you need to track down all the processes in the OS that could be doing it. Look at your anti-virus software very carefully. The Anti-Virus stuff, if set to ON all the time, could definitely cause the I/O issue everytime SQL tries to write to the disk. AV needs to be configured to run at intervals so it doesn't screw with SQL server writing to its own database files.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Clive Strong SSChampion Points: 11222 More actions · Answer 6

Also ask the admins who monitor the SAN to look at the disk allocation and any hot spots.

At a previous company, we had an issue on the SAN that was actually down to another set of disks which were being used by a reporting server causing an IO bottleneck. It just turned out the physical disks were in the same group as the data disks for my SQL box...