July 16, 2012 at 3:32 pm
SQL 2008 R2 STANDARD RTM
4 Socket (6 Core Westmere) - Hyperthreading Disabled (24 Cores Total)
128 GB RAM
Max Dop 1
SELECT state, COUNT(*) AS 'Count of State'
FROM sys.dm_os_workers
GROUP BY state
state Count of State
RUNNING 16
SUSPENDED 126
Why do most workers in sys.dm_os_workers show SUSPENDED and none say RUNNABLE?
-- Check SQL Server Schedulers to see if they are waiting on CPU
SELECT scheduler_id, current_tasks_count, runnable_tasks_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255 AND runnable_tasks_count <> 0
When you refresh this query it shows scheduler_id's that have a runnable_tasks_count of 1. Is this a sign of blocking?
Performance seems like it could be much better on this server. There are over 1000+ sleeping connections with ~5 sessions that become active at a point in time.
Can anyone assist with troubleshooting the WAIT TYPE SOS_SCHEDULER_YIELD? Please let me know if I can provide additional information.
____________________________________________
b.hedge
@hedgeb
July 16, 2012 at 3:39 pm
b.hedge (7/16/2012)
Why do most workers in sys.dm_os_workers show SUSPENDED and none say RUNNABLE?
They're waiting for some resource. Can't tell what without seeing wait types
When you refresh this query it shows scheduler_id's that have a runnable_tasks_count of 1. Is this a sign of blocking?
No.
Performance seems like it could be much better on this server. There are over 1000+ sleeping connections with ~5 sessions that become active at a point in time.
Sleeping means doing nothing. They're not running queries, the connections are open but nothing's happening.
Can anyone assist with troubleshooting the WAIT TYPE SOS_SCHEDULER_YIELD? Please let me know if I can provide additional information.
Do you have a lot of things with a last wait of that?
p.s. Why do you have maxdop set to 1?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 16, 2012 at 4:08 pm
MAX DOP is 1 and Hyperthreading is off per the vendor recommended settings. This server is 100% Ad Hoc SQL for its workload. We have SQL tuned for Optimize AdHoc Workloads = True. The Vendor has forced paramaterization configured on the database.
Most all of the running SPIDs have a Last Wait Type of SOS_SCHEDULER_YIELD when they begin running.
CPU utilization of the server is around 6 - 10% even during peak times.
When I run the query to check waits:
-- Script 21
-- Isolate top waits for server instance since last restart
-- or statistics clear
WITH Waits
AS ( SELECT wait_type ,
wait_time_ms / 1000. AS wait_time_s ,
100. * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS pct ,
ROW_NUMBER() OVER ( ORDER BY wait_time_ms DESC ) AS rn
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ( 'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP',
'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK',
'SQLTRACE_BUFFER_FLUSH', 'WAITFOR',
'LOGMGR_QUEUE', 'CHECKPOINT_QUEUE',
'REQUEST_FOR_DEADLOCK_SEARCH',
'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
'BROKER_TASK_STOP',
'CLR_MANUAL_EVENT',
'CLR_AUTO_EVENT',
'DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT',
'XE_DISPATCHER_WAIT',
'XE_DISPATCHER_JOIN', 'TRACEWRITE', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP' )
)
SELECT W1.wait_type ,
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s ,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct ,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2 ON W2.rn <= W1.rn
GROUP BY W1.rn ,
W1.wait_type ,
W1.wait_time_s ,
W1.pct
HAVING SUM(W2.pct) - W1.pct < 95 ; -- percentage threshold
-- Script 20
-- Total waits are wait_time_ms (high signal waits indicates CPU pressure)
SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)
AS NUMERIC(20,2)) AS [%signal (cpu) waits] ,
CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms)
/ SUM(wait_time_ms) AS NUMERIC(20, 2)) AS [%resource waits]
FROM sys.dm_os_wait_stats ;
I get:
wait_typewait_time_spctrunning_pct
OLEDB 279950.7884.5484.54
WRITELOG34669.70 10.4795.01
There is not a Linked server but this server does have Log shipping. Not sure if that would present an OLEDB wait_type.
If I exclude the OLEDB from the list, I get:
wait_typewait_time_spctrunning_pct
WRITELOG34703.9967.7467.74
PAGEIOLATCH_SH6198.5212.1079.83
PAGEIOLATCH_EX1534.973.0082.83
LCK_M_S1435.732.8085.63
BACKUPIO1155.862.2687.89
LCK_M_IX950.551.8689.74
ASYNC_IO_COMPLETION813.371.5991.33
BACKUPBUFFER693.271.3592.69
SOS_SCHEDULER_YIELD663.801.3093.98
IO_COMPLETION571.061.1195.10
I guess the SOS_SCHEDULER_YIELD is not as big of an issue as I thought but it still seems suspicious.
Just found this from Kevin Kline:
http://sqlblog.com/blogs/kevin_kline/archive/2008/05/14/sos-scheduler-yield.aspx
Now looking into this:
---- Total waits are wait_time_ms
Select signal_wait_time_ms=sum(signal_wait_time_ms)
,'%signal (cpu) waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))
,resource_wait_time_ms=sum(wait_time_ms - signal_wait_time_ms)
,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))
From sys.dm_os_wait_stats
signal_wait_time_ms%signal (cpu) waitsresource_wait_time_ms%resource waits
572456296 14.19 3462572833 85.81
July 16, 2012 at 4:27 pm
b.hedge (7/16/2012)
MAX DOP is 1 and Hyperthreading is off per the vendor recommended settings.
Not sure I like that recommendation....
CPU utilization of the server is around 6 - 10% even during peak times.
Sounds like a server with a lot of spare CPU on it. From the waits, I'd be looking more at the IO subsystem than the CPUs
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 18, 2012 at 2:00 pm
Thanks for the reply. When we increase MAX DOP above 1, the application experiences an increase in deadlocks and I think that is the reason for their setting.
Also, I have been looking into storage too. I had been talking with the storage team and was assured that everything was great with storage. SAN guy said all of the reports on I/O and service times looked great. However, I was able to capture the following from the sp_whoisactive (provided graciously by Adam Machanic) while the issue started to occur:
dd hh:mm:ss.msssession_idlogin_namewait_infostatusCPUreadswritestempdb_allocationstempdb_currentblocking_session_idphysical_readsused_memoryopen_tran_count
00 00:56:17.6001641cma(493ms)ASYNC_NETWORK_IOsuspended46557000NULL06880
00 00:40:02.903981ebo(905ms)ASYNC_NETWORK_IOsuspended31557000NULL07340
00 00:00:00.586147sqlUser(544ms)WRITELOGsuspended67061,6464700NULL121
00 00:00:00.530262sqlUser(488ms)WRITELOGsuspended013000NULL021
00 00:00:00.476507sqlUser(437ms)WRITELOGsuspended03000NULL021
00 00:00:00.473126sqlUser(430ms)WRITELOGsuspended030100NULL021
00 00:00:00.453564sqlUser(416ms)WRITELOGsuspended014000NULL021
00 00:00:00.323192sqlUser(279ms)WRITELOGsuspended026600NULL021
00 00:00:00.243462sqlUser(205ms)WRITELOGsuspended013000NULL021
00 00:00:00.183141sqlUser(140ms)WRITELOGsuspended064000NULL021
00 00:00:00.180647sqlUser(145ms)WRITELOGsuspended013000NULL021
00 00:00:00.153562sqlUser(115ms)WRITELOGsuspended014200NULL021
00 00:00:00.103665sqlUser(69ms)WRITELOGsuspended027700NULL021
00 00:00:00.063432sqlUser(22ms)WRITELOGsuspended014200NULL021
Looks like the SQL server is waiting on writing to the log... I questioned the SAN guy further on the "Average" latency of 5 ms. The single sample interval is 60 seconds and is averaged over all the samples in the time window. Needless to say, they were missing the issue and I stated that for LOG writes, I do not care what the average is... it must handle the burst of traffic gracefully too and flush the writes to disk rapidly.
Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly. I have made a request for dedicated spindles for the SQL log file on this cluster. This should cut down the random log jam that is happening on this server.
I am also attaching a cool picture of the number of connections on this server. Enjoy.
July 18, 2012 at 2:07 pm
b.hedge (7/18/2012)
Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly.
Ow and ow. Why bypass cache? Write cache is there to absorb the bursts and write steadily to the disks.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
July 19, 2012 at 11:20 am
GilaMonster (7/18/2012)
b.hedge (7/18/2012)
Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly.Ow and ow. Why bypass cache? Write cache is there to absorb the bursts and write steadily to the disks.
Lots of unfortunate stuff coming out in this thread. b.hedge, if possible I recommend you get a performance tuning professional on board to do a quick review of your system - I expect there will be even more issues found than you have put out to us already!!
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
July 20, 2012 at 8:54 am
Thanks Gail and Kevin... I worked withe the storage team and got the log file moved to dedicated spindles. Once we did this, we were able to see a process that was being executed in the application and began at noon and ran for 2.5 hours. This task was writing inserts to the log and using 500+ IOPS sustained for 2.5 hours. This would have KILLED this system before on the shared disk group on the SAN.
Let this be a lesson... when a SAN storage guy looks at I/O and says "You are only using 30 IOPS on this drive"... that does not mean that SQL server does not require over 500 IOPS for 2.5 hours (because SQL is I/O bound). Also, have him look at I/O contention on the disk group if the I/O is shared.
I have moved on to additional issues in the environment. I originally attacked the large number of sessions coming from the App server, because when the WRITE LOG waits occurred, more and more sessions would remain active as the App's Pool of connections was used. Now with the dedicated I/O on the SQL Log files, we can move up the chain to resolve additional issues with this server / app.
Thanks again Gail and Kevin.
July 20, 2012 at 8:59 am
If you haven't already, chapter 1 at very least of http://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply