December 3, 2015 at 11:56 am
Hi All,
Seeking for your guidance on Tempdb
We have tried to optimize Tempdb is out env. Below are the ones we have taken care.
•enabled instant file initialization
•pre-sized the data files. All of them has same size each 90GB
•created multiple data files of same size and same auto growth settings
•auto growth specified in fixed sizes i.e. 500MB
•enabled trace flag -T1118
•Still we see I/O taking more 15 ms in SQL server Errorlog
•No of physical cpus= 2
•no of logical cpu count = 24
Not sure what is wrong. Can anybody share your thoughts on this?
From SQL Errorlog:
spid7s SQL Server has encountered 3146 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [T:\SQL\TEMP\tempdb_1.mdf] in database [tempdb] (2). The OS file handle is 0x0000000000001350. The offset of the latest long I/O is: 0x0000033
spid7s SQL Server has encountered 3147 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [T:\SQL\TEMP\tempdb_2.mdf] in database [tempdb] (2). The OS file handle is 0x00000000000014A4. The offset of the latest long I/O is: 0x0000033
Thank you.
December 3, 2015 at 3:00 pm
If it takes 15 seconds to grow tempdb by merely 500 MB, your disks suck :hehe:.
Disk performance needs to be measured by using no greater than a 5 second polling interval, because the stalled IO message concerns a specific 15 second period of time.
If you want to be sure where the stalled IO problem exists (in WIndows or the HW), follow http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx
You should also consider lowering autogrowth to 250 MB, ensure tempdb's autogrowth is only used in emergencies (by equally sizing tempdb files to the maximum space needed by SQL Server's clients), and ensure anti-virus SW is not attempting to check tempdb's files (more generally, follow https://support.microsoft.com/en-us/kb/309422).
But the bottom line is that 15 seconds is an aeon.
December 4, 2015 at 6:30 am
Thank you.
December 4, 2015 at 6:43 am
To start, there's no indication the slow IOs are caused by autogrow operations, so changing your autogrow settings is probably not indicated at this point.
For your disks to be taking over 15 seconds, on over 3000 operations in a 2 minute period, pretty much indicates that either your IO subsystem is under massive strain or you're running TempDB off drives slower than the old floppy drives. Go sit with your SAN admin/sysadmin and investigate why you're getting such slow response off the IO subsystem.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 4, 2015 at 11:25 pm
Thanks Gail.
December 4, 2015 at 11:32 pm
Hi Gail,
We do see high latency values over a sustained period of times. But SAN guys are stating they don't find any latency at LUN layer from their end. they are claiming that max they can see 4ms latency.
How can I prove it is a SAN issue? is there any specific perfmon counters which can prove there is problem in I/O path? I mean what all things can be checked?
For instance, at one instance
Min AvgMax
Greater than 25 ms physical disk READ response times PhysicalDisk(14 H:)\Avg. Disk sec/Read .006 .506 6.269
Greater than 600 ms - Slower than a 3.5 inch floppy drive LogicalDisk(H:)\Avg. Disk sec/Read .006 .465 5.914
December 5, 2015 at 1:41 pm
Don't 'prove it's a san issue'. Go sit with the SAN admins and debug this together. Pull the IO latency stats from the server and go and sit with them and work with them. If the server is seeing x ms latency and the disks are showing y ms latency and x >>> y, then between you all, you need to work out why.
Trying to prove it's some one else's problem is not going to get it fixed, it's going to lead to a finger pointing problem
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 5, 2015 at 8:36 pm
vsamantha35 (12/3/2015)
•No of physical cpus= 2•no of logical cpu count = 24
I'm thinking that the logical to physical CPU ratio on that box is the real problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 6, 2015 at 8:31 am
We do see high latency values over a sustained period of times. But SAN guys are stating they don't find any latency at LUN layer from their end. they are claiming that max they can see 4ms latency.
How can I prove it is a SAN issue? is there any specific perfmon counters which can prove there is problem in I/O path? I mean what all things can be checked?
Follow http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx. You have the number by which you will need to filter (4 ms), but you might want to be a little more generous (say 5 ms or 10 ms). If the SAN folks are grossly overlooking the latency, the trace will generate GB of data, quickly. The hardware manufacturer's driver exists between the LUNs and storport.sys (who manages it?).
SQL Server is complaining about a specific 15 second period of time. Consider the polling interval used by the SAN folks, versus the polling interval used by perfmon, versus the 15 second interval of time SQL Server is concerned about: If the SAN folks LUN polling interval is (let's say) 15 minutes, there is a snowball's chance in hell of their polling being able to catch a 15 second delay. The "Avg. Disk sec/Read .006 .506 6.269" numbers suggest their median is closer to that .006 minimum than that .506 average. With an (imagined) 15 minute polling interval, the median is far more likely to be measured (repeatedly).... Because the Avg. Disk sec/whatever perfmon counter is itself an Avg., what perfmon reports is also impacted by perfmon's polling interval.... Ensure every process uses the same polling interval. To be certain about where the problem exists, the polling interval used by every process (including the SAN's measurement of latency at the LUN level) should be less than 15 seconds, preferably ~5 seconds.
If the SAN folks are correct, the storport trace will catch nothing. If the storport trace catches nothing while (at the same time) the perfmon counters say "slow" or SQL Server says "waited 15 seconds", you are looking at an OS-caused delay.
You should also consider perfmon's Disk Bytes/sec, Avg. Disk Bytes/whatever, and the various disk queue length counters, for all disks that use a common IO path (driver).
December 7, 2015 at 11:44 am
Why could CPU ratio can be a problem? could you please elaborate ?
December 7, 2015 at 11:49 am
Hi SoHelpMeCodd,
Thanks for your inputs. We did used the polling interval of perfmon as 5 secs. I need to try the storport tracing and see.
Thanks again.
December 8, 2015 at 2:03 am
Jeff Moden (12/5/2015)
vsamantha35 (12/3/2015)
•No of physical cpus= 2•no of logical cpu count = 24
I'm thinking that the logical to physical CPU ratio on that box is the real problem.
Could be by physical CPUs he means processors, not physical cores. That would mean there's two 6-core processors with hyperthreading, for 24 logical cores
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply