WAIT TYPE SOS_SCHEDULER_YIELD - Scheduler / Worker Issue SQLOS?

Question

Post reply

WAIT TYPE SOS_SCHEDULER_YIELD - Scheduler / Worker Issue SQLOS?

b.hedge

Mr or Mrs. 500

Points: 502
More actions
July 16, 2012 at 3:32 pm

#258994

SQL 2008 R2 STANDARD RTM
4 Socket (6 Core Westmere) - Hyperthreading Disabled (24 Cores Total)
128 GB RAM
Max Dop 1
SELECT state, COUNT(*) AS 'Count of State'
FROM sys.dm_os_workers
GROUP BY state
state Count of State
RUNNING 16
SUSPENDED 126
Why do most workers in sys.dm_os_workers show SUSPENDED and none say RUNNABLE?
-- Check SQL Server Schedulers to see if they are waiting on CPU
SELECT scheduler_id, current_tasks_count, runnable_tasks_count
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255 AND runnable_tasks_count <> 0
When you refresh this query it shows scheduler_id's that have a runnable_tasks_count of 1. Is this a sign of blocking?
Performance seems like it could be much better on this server. There are over 1000+ sleeping connections with ~5 sessions that become active at a point in time.
Can anyone assist with troubleshooting the WAIT TYPE SOS_SCHEDULER_YIELD? Please let me know if I can provide additional information.
____________________________________________
b.hedge
@hedgeb

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

b.hedge (7/16/2012)
Why do most workers in sys.dm_os_workers show SUSPENDED and none say RUNNABLE?

They're waiting for some resource. Can't tell what without seeing wait types

When you refresh this query it shows scheduler_id's that have a runnable_tasks_count of 1. Is this a sign of blocking?

No.

Performance seems like it could be much better on this server. There are over 1000+ sleeping connections with ~5 sessions that become active at a point in time.

Sleeping means doing nothing. They're not running queries, the connections are open but nothing's happening.

Can anyone assist with troubleshooting the WAIT TYPE SOS_SCHEDULER_YIELD? Please let me know if I can provide additional information.

Do you have a lot of things with a last wait of that?

p.s. Why do you have maxdop set to 1?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

b.hedge Mr or Mrs. 500 Points: 502 More actions · Answer 2

MAX DOP is 1 and Hyperthreading is off per the vendor recommended settings. This server is 100% Ad Hoc SQL for its workload. We have SQL tuned for Optimize AdHoc Workloads = True. The Vendor has forced paramaterization configured on the database.

Most all of the running SPIDs have a Last Wait Type of SOS_SCHEDULER_YIELD when they begin running.

CPU utilization of the server is around 6 - 10% even during peak times.

When I run the query to check waits:

-- Script 21

-- Isolate top waits for server instance since last restart

-- or statistics clear

WITH Waits

AS ( SELECT wait_type ,

wait_time_ms / 1000. AS wait_time_s ,

100. * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS pct ,

ROW_NUMBER() OVER ( ORDER BY wait_time_ms DESC ) AS rn

FROM sys.dm_os_wait_stats

WHERE wait_type NOT IN ( 'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP',

'RESOURCE_QUEUE', 'SLEEP_TASK',

'SLEEP_SYSTEMTASK',

'SQLTRACE_BUFFER_FLUSH', 'WAITFOR',

'LOGMGR_QUEUE', 'CHECKPOINT_QUEUE',

'REQUEST_FOR_DEADLOCK_SEARCH',

'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

'BROKER_TASK_STOP',

'CLR_MANUAL_EVENT',

'CLR_AUTO_EVENT',

'DISPATCHER_QUEUE_SEMAPHORE',

'FT_IFTS_SCHEDULER_IDLE_WAIT',

'XE_DISPATCHER_WAIT',

'XE_DISPATCHER_JOIN', 'TRACEWRITE', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP' )

)

SELECT W1.wait_type ,

CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s ,

CAST(W1.pct AS DECIMAL(12, 2)) AS pct ,

CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct

FROM Waits AS W1

INNER JOIN Waits AS W2 ON W2.rn <= W1.rn

GROUP BY W1.rn ,

W1.wait_type ,

W1.wait_time_s ,

W1.pct

HAVING SUM(W2.pct) - W1.pct < 95 ; -- percentage threshold

-- Script 20

-- Total waits are wait_time_ms (high signal waits indicates CPU pressure)

SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)

AS NUMERIC(20,2)) AS [%signal (cpu) waits] ,

CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms)

/ SUM(wait_time_ms) AS NUMERIC(20, 2)) AS [%resource waits]

FROM sys.dm_os_wait_stats ;

I get:

wait_typewait_time_spctrunning_pct

OLEDB 279950.7884.5484.54

WRITELOG34669.70 10.4795.01

There is not a Linked server but this server does have Log shipping. Not sure if that would present an OLEDB wait_type.

If I exclude the OLEDB from the list, I get:

wait_typewait_time_spctrunning_pct

WRITELOG34703.9967.7467.74

PAGEIOLATCH_SH6198.5212.1079.83

PAGEIOLATCH_EX1534.973.0082.83

LCK_M_S1435.732.8085.63

BACKUPIO1155.862.2687.89

LCK_M_IX950.551.8689.74

ASYNC_IO_COMPLETION813.371.5991.33

BACKUPBUFFER693.271.3592.69

SOS_SCHEDULER_YIELD663.801.3093.98

IO_COMPLETION571.061.1195.10

I guess the SOS_SCHEDULER_YIELD is not as big of an issue as I thought but it still seems suspicious.

Just found this from Kevin Kline:

http://sqlblog.com/blogs/kevin_kline/archive/2008/05/14/sos-scheduler-yield.aspx

Now looking into this:

---- Total waits are wait_time_ms

Select signal_wait_time_ms=sum(signal_wait_time_ms)

,'%signal (cpu) waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))

,resource_wait_time_ms=sum(wait_time_ms - signal_wait_time_ms)

,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))

From sys.dm_os_wait_stats

signal_wait_time_ms%signal (cpu) waitsresource_wait_time_ms%resource waits

572456296 14.19 3462572833 85.81

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

b.hedge (7/16/2012)
MAX DOP is 1 and Hyperthreading is off per the vendor recommended settings.

Not sure I like that recommendation....

CPU utilization of the server is around 6 - 10% even during peak times.

Sounds like a server with a lot of spare CPU on it. From the waits, I'd be looking more at the IO subsystem than the CPUs

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

b.hedge Mr or Mrs. 500 Points: 502 More actions · Answer 4

Thanks for the reply. When we increase MAX DOP above 1, the application experiences an increase in deadlocks and I think that is the reason for their setting.

Also, I have been looking into storage too. I had been talking with the storage team and was assured that everything was great with storage. SAN guy said all of the reports on I/O and service times looked great. However, I was able to capture the following from the sp_whoisactive (provided graciously by Adam Machanic) while the issue started to occur:

dd hh:mm:ss.msssession_idlogin_namewait_infostatusCPUreadswritestempdb_allocationstempdb_currentblocking_session_idphysical_readsused_memoryopen_tran_count

00 00:56:17.6001641cma(493ms)ASYNC_NETWORK_IOsuspended46557000NULL06880

00 00:40:02.903981ebo(905ms)ASYNC_NETWORK_IOsuspended31557000NULL07340

00 00:00:00.586147sqlUser(544ms)WRITELOGsuspended67061,6464700NULL121

00 00:00:00.530262sqlUser(488ms)WRITELOGsuspended013000NULL021

00 00:00:00.476507sqlUser(437ms)WRITELOGsuspended03000NULL021

00 00:00:00.473126sqlUser(430ms)WRITELOGsuspended030100NULL021

00 00:00:00.453564sqlUser(416ms)WRITELOGsuspended014000NULL021

00 00:00:00.323192sqlUser(279ms)WRITELOGsuspended026600NULL021

00 00:00:00.243462sqlUser(205ms)WRITELOGsuspended013000NULL021

00 00:00:00.183141sqlUser(140ms)WRITELOGsuspended064000NULL021

00 00:00:00.180647sqlUser(145ms)WRITELOGsuspended013000NULL021

00 00:00:00.153562sqlUser(115ms)WRITELOGsuspended014200NULL021

00 00:00:00.103665sqlUser(69ms)WRITELOGsuspended027700NULL021

00 00:00:00.063432sqlUser(22ms)WRITELOGsuspended014200NULL021

Looks like the SQL server is waiting on writing to the log... I questioned the SAN guy further on the "Average" latency of 5 ms. The single sample interval is 60 seconds and is averaged over all the samples in the time window. Needless to say, they were missing the issue and I stated that for LOG writes, I do not care what the average is... it must handle the burst of traffic gracefully too and flush the writes to disk rapidly.

Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly. I have made a request for dedicated spindles for the SQL log file on this cluster. This should cut down the random log jam that is happening on this server.

I am also attaching a cool picture of the number of connections on this server. Enjoy.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 5

b.hedge (7/18/2012)
Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly.

Ow and ow. Why bypass cache? Write cache is there to absorb the bursts and write steadily to the disks.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

GilaMonster (7/18/2012)
b.hedge (7/18/2012)
Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly.
Ow and ow. Why bypass cache? Write cache is there to absorb the bursts and write steadily to the disks.

Lots of unfortunate stuff coming out in this thread. b.hedge, if possible I recommend you get a performance tuning professional on board to do a quick review of your system - I expect there will be even more issues found than you have put out to us already!!

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

b.hedge Mr or Mrs. 500 Points: 502 More actions · Answer 7

Thanks Gail and Kevin... I worked withe the storage team and got the log file moved to dedicated spindles. Once we did this, we were able to see a process that was being executed in the application and began at noon and ran for 2.5 hours. This task was writing inserts to the log and using 500+ IOPS sustained for 2.5 hours. This would have KILLED this system before on the shared disk group on the SAN.

Let this be a lesson... when a SAN storage guy looks at I/O and says "You are only using 30 IOPS on this drive"... that does not mean that SQL server does not require over 500 IOPS for 2.5 hours (because SQL is I/O bound). Also, have him look at I/O contention on the disk group if the I/O is shared.

I have moved on to additional issues in the environment. I originally attacked the large number of sessions coming from the App server, because when the WRITE LOG waits occurred, more and more sessions would remain active as the App's Pool of connections was used. Now with the dedicated I/O on the SQL Log files, we can move up the chain to resolve additional issues with this server / app.

Thanks again Gail and Kevin.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

If you haven't already, chapter 1 at very least of http://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass