WAIT TYPE SOS_SCHEDULER_YIELD - Scheduler / Worker Issue SQLOS?

  • SQL 2008 R2 STANDARD RTM

    4 Socket (6 Core Westmere) - Hyperthreading Disabled (24 Cores Total)

    128 GB RAM

    Max Dop 1

    SELECT state, COUNT(*) AS 'Count of State'

    FROM sys.dm_os_workers

    GROUP BY state

    state Count of State

    RUNNING 16

    SUSPENDED 126

    Why do most workers in sys.dm_os_workers show SUSPENDED and none say RUNNABLE?

    -- Check SQL Server Schedulers to see if they are waiting on CPU

    SELECT scheduler_id, current_tasks_count, runnable_tasks_count

    FROM sys.dm_os_schedulers

    WHERE scheduler_id < 255 AND runnable_tasks_count <> 0

    When you refresh this query it shows scheduler_id's that have a runnable_tasks_count of 1. Is this a sign of blocking?

    Performance seems like it could be much better on this server. There are over 1000+ sleeping connections with ~5 sessions that become active at a point in time.

    Can anyone assist with troubleshooting the WAIT TYPE SOS_SCHEDULER_YIELD? Please let me know if I can provide additional information.

    ____________________________________________

    b.hedge

    @hedgeb

  • b.hedge (7/16/2012)


    Why do most workers in sys.dm_os_workers show SUSPENDED and none say RUNNABLE?

    They're waiting for some resource. Can't tell what without seeing wait types

    When you refresh this query it shows scheduler_id's that have a runnable_tasks_count of 1. Is this a sign of blocking?

    No.

    Performance seems like it could be much better on this server. There are over 1000+ sleeping connections with ~5 sessions that become active at a point in time.

    Sleeping means doing nothing. They're not running queries, the connections are open but nothing's happening.

    Can anyone assist with troubleshooting the WAIT TYPE SOS_SCHEDULER_YIELD? Please let me know if I can provide additional information.

    Do you have a lot of things with a last wait of that?

    p.s. Why do you have maxdop set to 1?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • MAX DOP is 1 and Hyperthreading is off per the vendor recommended settings. This server is 100% Ad Hoc SQL for its workload. We have SQL tuned for Optimize AdHoc Workloads = True. The Vendor has forced paramaterization configured on the database.

    Most all of the running SPIDs have a Last Wait Type of SOS_SCHEDULER_YIELD when they begin running.

    CPU utilization of the server is around 6 - 10% even during peak times.

    When I run the query to check waits:

    -- Script 21

    -- Isolate top waits for server instance since last restart

    -- or statistics clear

    WITH Waits

    AS ( SELECT wait_type ,

    wait_time_ms / 1000. AS wait_time_s ,

    100. * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS pct ,

    ROW_NUMBER() OVER ( ORDER BY wait_time_ms DESC ) AS rn

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN ( 'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP',

    'RESOURCE_QUEUE', 'SLEEP_TASK',

    'SLEEP_SYSTEMTASK',

    'SQLTRACE_BUFFER_FLUSH', 'WAITFOR',

    'LOGMGR_QUEUE', 'CHECKPOINT_QUEUE',

    'REQUEST_FOR_DEADLOCK_SEARCH',

    'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

    'BROKER_TASK_STOP',

    'CLR_MANUAL_EVENT',

    'CLR_AUTO_EVENT',

    'DISPATCHER_QUEUE_SEMAPHORE',

    'FT_IFTS_SCHEDULER_IDLE_WAIT',

    'XE_DISPATCHER_WAIT',

    'XE_DISPATCHER_JOIN', 'TRACEWRITE', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP' )

    )

    SELECT W1.wait_type ,

    CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s ,

    CAST(W1.pct AS DECIMAL(12, 2)) AS pct ,

    CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct

    FROM Waits AS W1

    INNER JOIN Waits AS W2 ON W2.rn <= W1.rn

    GROUP BY W1.rn ,

    W1.wait_type ,

    W1.wait_time_s ,

    W1.pct

    HAVING SUM(W2.pct) - W1.pct < 95 ; -- percentage threshold

    -- Script 20

    -- Total waits are wait_time_ms (high signal waits indicates CPU pressure)

    SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms)

    AS NUMERIC(20,2)) AS [%signal (cpu) waits] ,

    CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms)

    / SUM(wait_time_ms) AS NUMERIC(20, 2)) AS [%resource waits]

    FROM sys.dm_os_wait_stats ;

    I get:

    wait_typewait_time_spctrunning_pct

    OLEDB 279950.7884.5484.54

    WRITELOG34669.70 10.4795.01

    There is not a Linked server but this server does have Log shipping. Not sure if that would present an OLEDB wait_type.

    If I exclude the OLEDB from the list, I get:

    wait_typewait_time_spctrunning_pct

    WRITELOG34703.9967.7467.74

    PAGEIOLATCH_SH6198.5212.1079.83

    PAGEIOLATCH_EX1534.973.0082.83

    LCK_M_S1435.732.8085.63

    BACKUPIO1155.862.2687.89

    LCK_M_IX950.551.8689.74

    ASYNC_IO_COMPLETION813.371.5991.33

    BACKUPBUFFER693.271.3592.69

    SOS_SCHEDULER_YIELD663.801.3093.98

    IO_COMPLETION571.061.1195.10

    I guess the SOS_SCHEDULER_YIELD is not as big of an issue as I thought but it still seems suspicious.

    Just found this from Kevin Kline:

    http://sqlblog.com/blogs/kevin_kline/archive/2008/05/14/sos-scheduler-yield.aspx

    Now looking into this:

    ---- Total waits are wait_time_ms

    Select signal_wait_time_ms=sum(signal_wait_time_ms)

    ,'%signal (cpu) waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))

    ,resource_wait_time_ms=sum(wait_time_ms - signal_wait_time_ms)

    ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))

    From sys.dm_os_wait_stats

    signal_wait_time_ms%signal (cpu) waitsresource_wait_time_ms%resource waits

    572456296 14.19 3462572833 85.81

  • b.hedge (7/16/2012)


    MAX DOP is 1 and Hyperthreading is off per the vendor recommended settings.

    Not sure I like that recommendation....

    CPU utilization of the server is around 6 - 10% even during peak times.

    Sounds like a server with a lot of spare CPU on it. From the waits, I'd be looking more at the IO subsystem than the CPUs

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for the reply. When we increase MAX DOP above 1, the application experiences an increase in deadlocks and I think that is the reason for their setting.

    Also, I have been looking into storage too. I had been talking with the storage team and was assured that everything was great with storage. SAN guy said all of the reports on I/O and service times looked great. However, I was able to capture the following from the sp_whoisactive (provided graciously by Adam Machanic) while the issue started to occur:

    dd hh:mm:ss.msssession_idlogin_namewait_infostatusCPUreadswritestempdb_allocationstempdb_currentblocking_session_idphysical_readsused_memoryopen_tran_count

    00 00:56:17.6001641cma(493ms)ASYNC_NETWORK_IOsuspended46557000NULL06880

    00 00:40:02.903981ebo(905ms)ASYNC_NETWORK_IOsuspended31557000NULL07340

    00 00:00:00.586147sqlUser(544ms)WRITELOGsuspended67061,6464700NULL121

    00 00:00:00.530262sqlUser(488ms)WRITELOGsuspended013000NULL021

    00 00:00:00.476507sqlUser(437ms)WRITELOGsuspended03000NULL021

    00 00:00:00.473126sqlUser(430ms)WRITELOGsuspended030100NULL021

    00 00:00:00.453564sqlUser(416ms)WRITELOGsuspended014000NULL021

    00 00:00:00.323192sqlUser(279ms)WRITELOGsuspended026600NULL021

    00 00:00:00.243462sqlUser(205ms)WRITELOGsuspended013000NULL021

    00 00:00:00.183141sqlUser(140ms)WRITELOGsuspended064000NULL021

    00 00:00:00.180647sqlUser(145ms)WRITELOGsuspended013000NULL021

    00 00:00:00.153562sqlUser(115ms)WRITELOGsuspended014200NULL021

    00 00:00:00.103665sqlUser(69ms)WRITELOGsuspended027700NULL021

    00 00:00:00.063432sqlUser(22ms)WRITELOGsuspended014200NULL021

    Looks like the SQL server is waiting on writing to the log... I questioned the SAN guy further on the "Average" latency of 5 ms. The single sample interval is 60 seconds and is averaged over all the samples in the time window. Needless to say, they were missing the issue and I stated that for LOG writes, I do not care what the average is... it must handle the burst of traffic gracefully too and flush the writes to disk rapidly.

    Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly. I have made a request for dedicated spindles for the SQL log file on this cluster. This should cut down the random log jam that is happening on this server.

    I am also attaching a cool picture of the number of connections on this server. Enjoy.

  • b.hedge (7/18/2012)


    Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly.

    Ow and ow. Why bypass cache? Write cache is there to absorb the bursts and write steadily to the disks.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/18/2012)


    b.hedge (7/18/2012)


    Found out that all Log file LUNs for multiple server were stored in the same disk group and was ~75 % subscribed. They also bypass cache and update the SAN disks directly.

    Ow and ow. Why bypass cache? Write cache is there to absorb the bursts and write steadily to the disks.

    Lots of unfortunate stuff coming out in this thread. b.hedge, if possible I recommend you get a performance tuning professional on board to do a quick review of your system - I expect there will be even more issues found than you have put out to us already!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks Gail and Kevin... I worked withe the storage team and got the log file moved to dedicated spindles. Once we did this, we were able to see a process that was being executed in the application and began at noon and ran for 2.5 hours. This task was writing inserts to the log and using 500+ IOPS sustained for 2.5 hours. This would have KILLED this system before on the shared disk group on the SAN.

    Let this be a lesson... when a SAN storage guy looks at I/O and says "You are only using 30 IOPS on this drive"... that does not mean that SQL server does not require over 500 IOPS for 2.5 hours (because SQL is I/O bound). Also, have him look at I/O contention on the disk group if the I/O is shared.

    I have moved on to additional issues in the environment. I originally attacked the large number of sessions coming from the App server, because when the WRITE LOG waits occurred, more and more sessions would remain active as the App's Pool of connections was used. Now with the dedicated I/O on the SQL Log files, we can move up the chain to resolve additional issues with this server / app.

    Thanks again Gail and Kevin.

  • If you haven't already, chapter 1 at very least of http://www.simple-talk.com/books/sql-books/troubleshooting-sql-server-a-guide-for-the-accidental-dba/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply