May 5, 2010 at 9:55 am
Guys,
I was told if I had any time out this morning and when I checked: I found for about 4 min in production server this message:
SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file for each database, what this is mean? It happend 4 hours ago and since then didn't. Can someone tell me what I should check and do?Thank you
May 5, 2010 at 9:58 am
When I have seen this is usually indicates a possible issue with your I/O subsystem. How is you server configured, is it using a SAN, NAS, or DASD?
How are the disk systems configured?
May 5, 2010 at 10:13 am
I can't get to the server, is there sql statment I can to check? Thank you
May 5, 2010 at 10:21 am
Try asking the sys admins responsible for the servers.
May 5, 2010 at 10:57 am
Krasavita (5/5/2010)
I can't get to the server, is there sql statment I can to check? Thank you
You won't be able to query that information. You will need to get that info from a Server Admin or somebody who configured the disk subsystem.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
May 5, 2010 at 11:00 am
Thank you
May 5, 2010 at 11:03 am
You're welcome.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
May 6, 2010 at 3:18 pm
I will bet $$ that this server is running diskkeeper.
May 6, 2010 at 5:09 pm
NJ-DBA (5/6/2010)
I will bet $$ that this server is running diskkeeper.
It's also possible that there is some other form of disk defragmenter on the system that runs a scheduled defrag.
Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events
May 7, 2010 at 6:03 am
I have had this issue myself. Let me tell you it is one of the most difficult issues to figure out what is causing it. I even had a ticket open with Microsoft and they could never nail down what it was either.
May 7, 2010 at 6:11 am
Did you ever figure it out? I have it going on right now also and am working to get hosting to remove the defrag utility (which I know has caused this exact problem in similar environments).
May 7, 2010 at 6:16 am
No we never did. Our production cluster was moved from a very old SAN to an old one that was 'faster' and that is when the problem showed up. We looked at everything, made some tweaks but it never helped. We had a cast of characters involved. We ended up just moving it to our latest and greatest SAN and the problem went away thankfully. I have two test SQL Servers that are experiencing this same issue but it is not causing any heartache.
May 7, 2010 at 8:32 am
I get this error periodically (2 clusters + EMC CX3 SAN over 4GB FC), and it's absolutely IO with the SAN.
I can trace it down to when I have a read + bulk write on cluster1 (importing from staging tables) from the same disk set, plus a user runs a query on cluster1 over a big piece of the target table (same disk set), plus I have a backup writing to the SAN from cluster2 (to a different disk set). These 3 things are too much at once.
This is tipping multiple pieces over the edge:
1. The disks are not at but close to max IOPS,
2. The cache on the storage processors -- very heavy reads + very heavy writes can be nasty to cache
3. the controller cards in the server plus their cache can only work so hard
4. The SAN FC switches
All I can offer here is first check your hardware setup, and you have to know your processes and run windows well enough to know what's happening and when. Start with the scheduled processes (SQL Agent, Windows scheduler, and any other softwares with a scheduler). Batch processes are typically the heaviest, and this is where you'll find them.
You can get your SAN vendor involved but they usually charge for the diagnostic work (and usually tell you you need to buy more hardware to fix it).
If you have a storage team, ask them if they can monitor the cache on the SAN & see if this is taking a hit. In the EMC world, this can either be the cache switching from read to write to read (etc), or the SAN stopping the cache totally as a data-defense mechanism which does terrible things to your IO throughput since everything has to fully commit before being considered done. Again, this is EMC, but I'd think that other vendors would have similar defensive methods around the SAN's cache.
NOTE: If you are running over ethernet (vs FC):
* Make sure you're SAN traffic is dedicated to a physically separate NIC with TOE capability)
* Have your network team monitor the switches connecting the server to SAN -- these may be simply overwhelmed with SAN data volume. Also check if these switches are shared with anything else -- a completely unrelated activity (think file copy, tape backup, etc) may be adding to the switches' load and you're hitting it's capacity.
If you think you have interactive processes contributing to the problem, like my user query, then you can either 1) catch it in the act, 2) trace it, or 3) record logins to see who was using the system at that time & ask them what they were doing at the time.
The good news is that you have a timestamp from the error so you know where & when to start looking for the offenders.
The bad news is it's usually a combination of things that tips it over the edge, so it's hard to predict and prevent. You can throw more SAN at it, or look at your processes and balance the run windows to at least minimize your chance of the error.
Also, Diskkeeper was mentioned in a previous post. Check if there is any defrag or other disk cleanup activities happening. I've seen multiple discussions here questioning if defragging a SAN does any noticable good, and I'll let you do your own reading and find your own conclusions on that one. Beyond the SAN, even defragging the local disks will have impact on the overall system (especially when defragging the disks with the Windows pagefile), so it's worth looking into.
May 7, 2010 at 8:38 am
One way I could 'force' it to get these errors was to run a Integrity Check. We did everything you said and there was no smoking gun.
May 8, 2010 at 7:04 pm
For us, it was a virus checker and someone forgot to exclude MDF, LDF, and NDF files. It brought us to a crawl. Diskkeeper, on the other hand, has never been a problem for us. We have it set to operate when it's "quiet" and it's a pretty tight set of databases and we do regular maintenance on all the machines so there's usually not a whole lot of physical fragmentation.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply