"cpu drift" possibly causing 15 second IO messages in log?

Question

"cpu drift" possibly causing 15 second IO messages in log?

Indianrock

SSC-Insane

Points: 20333
More actions
August 22, 2013 at 9:35 am

#298067

We've seen messages like the one below for years, on and off. Usually during index maintenance.
Looking for feedback from others who may have found a solution.
SQL Server has encountered 878 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [K:\tempdb_data3\tempdev3.mdf
We know our average read/write wait times are generally below 50 milliseconds (tempdb does show 1.2 seconds average write wait, no doubt skewed by updating stats on large tables with 80% sample rate ).
Our production sql cluster is currently running on 40 out of 64 cores until we upgrade the licensing model, but we've seen these messages in the log long before we upgraded to sql 2012 from 2005.
We also just switched the servers power config to run in performance mode, not energy savings ( no sql restart or reboot ). The 15-second messages still occur once in a while.
http://blogs.msdn.com/b/psssql/archive/2006/11/27/sql-server-2005-sp2-will-introduce-new-messages-to-the-error-log-related-to-timing-activities.aspx

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 1

If you have CPU affinity happening either explicitly or implicitly due to the 40-core limitation that could have some affect on IO throughput.

However, the MOST common cause of IO slowness is simply too-few spindles for the IO requests being made (often in combination with too little RAM as well). The next most common cause is piling all kinds of batch-type activities on top of each other. Can't tell you the number of times I do a performance review at a client and see 4, 7, dozen+ jobs or batch activities all set to start at 00:00, or 02:00, etc. SQLJOBVIS is a free tool that can show you a gantt chart of your sql agent jobs to help detect overlaps. Do be sure to check external processes that could be happening too.

There are MANY other things that could be in play here but I think throwing them out without further information is a waste of time.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Indianrock SSC-Insane Points: 20333 More actions · Answer 2

The servers have 512GB of memory with sql allowed to use 348GB. The numerous data files are all on a Netapp SAN device. The main user database is 3.3TB with an aspstate database also receiving a lot of traffic. We have used a large number of LUNs to spread out the data files, with the transaction log for the main database on its own lun and the four tempdb data files and log file all on their own Luns. Going to add four more same-sized tempdb data files this sunday.

Most of the problem queries come from the application ( ORM generated ) and having reporting and OLTP traffic all on one database is a problem --- to be addressed along with a plan to archive/partition millions of older records. Sql agent jobs and batch jobs run by the application scheduler are spread out as much as possible.

Since we have NEVER seen database corruption from these "15-second" messages, it is mainly something upper management gets in a tizzy about. 15 seconds? Nothing is taking 15 seconds.

Sql server 2012 Enterprise 64-bit on Servers 2008R2 64-bit ( prod databases still in 2005 compatibility mode )

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 3

Indianrock (8/22/2013)
The servers have 512GB of memory with sql allowed to use 348GB. The numerous data files are all on a Netapp SAN device. The main user database is 3.3TB with an aspstate database also receiving a lot of traffic. We have used a large number of LUNs to spread out the data files, with the transaction log for the main database on its own lun and the four tempdb data files and log file all on their own Luns. Going to add four more same-sized tempdb data files this sunday.
Most of the problem queries come from the application ( ORM generated ) and having reporting and OLTP traffic all on one database is a problem --- to be addressed along with a plan to archive/partition millions of older records. Sql agent jobs and batch jobs run by the application scheduler are spread out as much as possible.
Since we have NEVER seen database corruption from these "15-second" messages, it is mainly something upper management gets in a tizzy about. 15 seconds? Nothing is taking 15 seconds.
Sql server 2012 Enterprise 64-bit on Servers 2008R2 64-bit ( prod databases still in 2005 compatibility mode )

More than a few times I have had to sit down with CIOs to explain why their SAN was the CAUSE of poor SQL Server performance, not helping to prevent poor performance. "own LUN", "large number of LUNs", etc often don't mean what DBAs think they do (not saying that is the case here - just what I see over and over again). SOOOO many ways a SAN can be sub-optimally configured for SQL Server IO performance!! And even if you do have a metric butt ton of 15K drives in big RAID 10 sets with NO overlap/sharing at all there is still the front side adapter, switch fabric and HBA(s) that can be bottlenecks. If you are getting the 15 second messages then at least during those times you have severe IO throughput/response time issues. Time-interval file IO stall analysis can help pinpoint which file(s) are getting hammered.

I feel your pain on the ORM front. One magic bullet I have found on those sometimes is a setting that caused ALL values to be passed in as Unicode character strings (i.e. N'12345' instead of 12345 for example. Total killer due to CONVERT_IMPLICITs.

Note that putting too many files (or LUNs) on too few spindles can easily DECREASE IO performance due to increased head thrashing and seek times.

Good luck getting things improved!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Anonymous SSC-Dedicated Points: 32891 More actions · Answer 4

The I/O issue is very difficult to pin down. I have experienced these several times here and it always comes back to the back end storage one way or another. The SAN team, of course, doesn't want to admit it but in the end after poking and prodding it ends up being the storage or the configruation of the storage.

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 5

Indianrock (8/22/2013)
Since we have NEVER seen database corruption from these "15-second" messages, it is mainly something upper management gets in a tizzy about. 15 seconds? Nothing is taking 15 seconds.

No, these 15-second delays does not cause corruption on their own. SQL Server is too smart for that.

However, those error message indicates that you disk subsystem is not to par. 15 seconds is an extremely long time for a disk access. In you original post you said We know our average read/write wait times are generally below 50 milliseconds Well, 50 ms still above what is acceptable for a spinning disk. And 1.2 seconds you mention for tempdb is outrageous.

As Kevin hinted at, getting a SAN to work properly with SQL Server can be a challenges, not the least when it comes to talking to the SAN guys. And if there are other systems using the SAN as well, predictable performance can be really difficult to achieve.

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Indianrock SSC-Insane Points: 20333 More actions · Answer 6

I suspect the daytime average write wait on tempdb is very low, but the average is driven very high by doing update statistics on tables with over 150 million records at an 80% sample rate.

The point of this article was to explore the possibility that these 15-second messages are bogus and caused by something odd going on with CPUs.

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 7

Indianrock (8/23/2013)

I suspect the daytime average write wait on tempdb is very low, but the average is driven very high by doing update statistics on tables with over 150 million records at an 80% sample rate.

1) You should never see wait times like 1.2 seconds on a reasonably tuned IO system. There is no reason to accept it.

2) As I recall FULLSCAN gives better performance than high sample rates. I don't remember exactly where the tipping point is, but I think it may be around 50%. I think Linchi Shea has a blog post on this.

The point of this article was to explore the possibility that these 15-second messages are bogus and caused by something odd going on with CPUs.

I hold that as unlikely. Particularly when it is clear that you I/O subsystem needs an overhaul. You SAN guys need to review their configuration.

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Indianrock SSC-Insane Points: 20333 More actions · Answer 8

Thanks or the feedback. I did see this article, which references Linchi Shea, seems to show that fullscan can be faster than sampling greater than 50% or so due to multiple-threads and read ahead.

http://www.sqlperformance.com/2013/04/t-sql-queries/update-statistics-duration