Avg Disk Queue Length Issues

Question

Avg Disk Queue Length Issues

Orcward

SSC Veteran

Points: 270
More actions
February 22, 2022 at 10:19 am

#3986785
Hi All,
We are running a production system with MS SQL2016 Std and for the past year or so things have been going well... Then all of as sudden about 1 month ago I started recieving complaints about Server Timeouts. I used Perfmon to analyze the Server and found that Avg Disk read and write spiked but came down just as fast as they went up and for SQL this should be expected..... however as soon as I added the Queue Length to the Graph it would spike to 100 and just stay there indefinitely. so the general trend is, this is the SAN.... but then I go to the O/S and I took 4 of my largest ISO's, compressed them and copied them to the SAN volume and this +5GB file was copied in just over 3 seconds. On closer inspection I found that the O/S is responding well. It's able to operate as expect and it's responding quickly. So the question is where do i look next.. I have some sneaky suspicion that SQL may be causing this issue and not the SAN but I have no way of determining exactly how to pin point my problem so we can get back to normal.
Just to clarify : the Above server is a VMWare instance running 10 vCPU, 65GB Ram, and all disks including C are on the SAN. So if the SAN was misbehaving we would see it on every volume..... not selected volumes. We have multiple data files spread amongst multiple SAN Drives and we are using partitioned tables. Before you jump on it, the above issue is affecting all of our tables. Some of them are partitioned and some are not. We have checked all the indexes, execution plans etc and cannot see anything wrong.
My Suspicion : I think SQL is batching all disk read/writes and we are seeing timeouts because SQL is not acting on the read/write efficiently.
Kind Regards,
Lloyd
- This topic was modified 4 years ago by Orcward.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 399379 More actions · Answer 1

When I see a sudden change in behavior like this, first is checking all the servers & os's, etc., for changes. A new patch on the SAN or on the network or anything else along those lines? Second, code. New deployments, changes to the code or structures? Third, data. As data changes you may see queries that previously ran fine, suddenly running badly, as the rows returned results in changes to execution plans & subsequent poor performance. Do you have execution plans from when it was running well that you can compare to execution plans now?

That's about what I've got based on the info we have.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Orcward SSC Veteran Points: 270 More actions · Answer 2

Thanks for your feedback. Sadly I have checked all 3 items you mentioned and i have received confirmation that the SAN has not changes, the Server hasnt taken any updates and we havent done a software drop/update for about 5 months. So although we are running a Transactional Database, we havent added or changed anything so the above issue is very perplexing.

Grant Fritchey SSC Guru Points: 399379 More actions · Answer 3

Data.

As data changes, so does behavior. A query that was fine for 10k rows isn't for 10million. Check the execution plans. Look at row estimates there. Compare them to reality. Sometimes it could just be statistics. However, probably, it's the data has changed and you need to look at the query & structures for new tuning opportunities.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Jeff Moden SSC Guru Points: 1004703 More actions · Answer 4

Am I understanding correctly that you say there have been no data changes in the server instance for 5 months and no one has changed any code in the server nor from anywhere that talks to the server and that no one has made any additional index nor removed any? And, to be certain, no one has made the mistake of using SHRINKDB or SHRINKFILE and there's been no need to do any index maintenance or statistics maintenance because no data has changed? To with, you're saying that the server has been basically read-only for 5 months (even though you've not set the databases to read only)?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sergiy SSC Guru Points: 110209 More actions · Answer 5

Orcward wrote:

Thanks for your feedback. Sadly I have checked all 3 items you mentioned and i have received confirmation that the SAN has not changes, the Server hasnt taken any updates and we havent done a software drop/update for about 5 months. So although we are running a Transactional Database, we havent added or changed anything so the above issue is very perplexing.

But you've added some new data records for certain tables over that time, right?

_____________
Code for TallyGenerator

Orcward SSC Veteran Points: 270 More actions · Answer 6

Hi Jeff,

To clarify, no the Database is 100% active as it is a transactional Database. what I was attempting to add above was that there have been no code changes to the Database (indexes or execution plans) or the software that uses these tables. I have also made no changes to the Server or the Operating System (No updates have been installed) and SAN which has been there from the very word go is still operating as expected. What I am finding really odd, is we are seeing these extremely high Disk Queues which are the source of our Timeouts on the Database. If I do a simple test by copying an extremely large file to the SAN I see no problems with the disk. I don't believe that any shrinks have been done and we have tried doing UPDATE Stats but it not given us any joy.

Regards, Lloyd

Orcward SSC Veteran Points: 270 More actions · Answer 7

Hi Sergiy,

Yes, we are adding records all the time but these tables also get cleaned up regularly so the data inside them cannot be older than a given rolling date. (nothing older then 90 or 180 days). However what I don't understand is how does this possible growth suggestion result in SQL excessive Disk Queue.... this usually points to a problematic SAN but i have done multiple write tests directly to the SAN and it's been fine so all i can think is that SQL is somehow using some caching feature and delaying it's write to disk which is a disaster for any transactional Database. I don't want SQL to queue anything, I want it to just write it to disk immediately - no need to wait. We have looked at Execution plans, we have also run Update Stats and checked the indexes and come back empty handed. we have looked at tempdb and SQL's cache which yielded no fruit.