Blog Post

Troubleshooting THREADPOOL Waits

,

Today I want to talk about a specified wait type that can be sometimes very hard to

troubleshoot: the THREADPOOL wait type. This wait type is specific

to the internal thread scheduling mechanism inside SQL Server.

As you might know SQL Server internally uses its own "operating system" to implement

thread scheduling and memory management – the SQLOS. The SQLOS provides a set of worker

threads that are used to execute queries that are submitted to SQL Server. The problem

is that those worker threads can be exhausted sometimes - maybe because of a Locking/Blocking

scenario. In this case SQL Server isn't able to execute any more requests inside the

engine, because no free worker threads are available any more.

You can configure through the max worker threads option (through sp_configure)

how many worker threads are available to SQLOS. By default the value of this option

is 0, which means SQL Server itself decides how many worker threads are used. The

number of the available worker threads depends on the processor architecture (x32,

x64) and the number of CPUs that you have available. Books Online (see http://msdn.microsoft.com/en-us/library/ms187024.aspx)

has the following table that describes the various possible combinations:

Number of CPUs

x32

x64

<= 4 Processors

256

512

8 Processors

288

576

16 Processors

352

704

32 Processors

480

960

You can also check through the column max_workers_count in sys.dm_os_sys_info how

many worker threads your SQL Server instance is using. With the following example

I want to demonstrate now how you can get thread starvation in SQL Server and how

you can resolve it.

CAUTION: DON'T DO THE FOLLOWING STEPS ON A PRODUCTION

SYSTEM!!!

In the first step we create a new database and a simple table for our sample scenario.

I want to be unique as possible; therefore I use unique table and column names πŸ˜‰

USE master

GO

CREATE DATABASE ThreadPoolWaits

GO

USE ThreadPoolWaits

GO

-- Create a new

test table (this one will be unique on earth - hopefully...)

CREATE TABLE [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]

(

[MyUniqueColumnName1_F67DAC4A-C202-49BB-829A-071130BF1160]

INT IDENTITY(1, 1) NOT NULL PRIMARY KEY,

[MyUniqueColumnName2_438B7184-B476-48A4-B5FA-DC34B99FA0A4]

INT

)

GO

-- Insert a record

INSERT INTO [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]

VALUES (1)

GO

As you can see from the previous listing, our table definition is very simple. In

the next step I'm creating a new stored procedure that encapsulates some read workload

inside that database.

-- Create a stored

procedure that encapsulates a read workload

CREATE PROCEDURE MyCustomUniqueStoredProcedureName_ReadWorkload

AS

BEGIN

SELECT * FROM [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]

END

GO

Finally we are beginning a new transaction, making an update to the previous created

table, and never committing that transaction:

-- Begin a transaction

that never commits...

BEGIN TRANSACTION

UPDATE [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]

WITH (TABLOCKX)

SET [MyUniqueColumnName2_438B7184-B476-48A4-B5FA-DC34B99FA0A4] = 2

GO

By now you have a pending transaction inside your SQL Server instance that holds an

exclusive table lock because of the TABLOCKX query hint. When you

now execute the previous created stored procedure from a different session, the stored

procedure is waiting because it needs to acquire a Shared lock for reading the record:

EXEC MyCustomUniqueStoredProcedureName_ReadWorkload

GO

You can also check this Locking/Blocking scenario through the DMV sys.dm_db_tran_locks,

which will show you a waiting request:

SELECT

resource_associated_entity_id,

request_mode,

request_status,

request_session_id

FROM sys.dm_tran_locks

WHERE resource_database_id = DB_ID('ThreadPoolWaits')

AND resource_type = 'OBJECT'

GO

In this simple scenario with just one waiting query inside SQL Server, nothing special

occurs. But how will SQL Server react when you use a massive amount of queries that

is larger than the possible max worker threads? Let's try it. I'm using for this task

the ostress.exe utility that is part of the RML Tools that are provided free by Microsoft

(see here).

In my configuration (x64, 8 CPUs) SQL Server uses internally 576 worker threads. So

I'm simulating with ostress.exe 600 concurrent connections to SQL Server through the

following command line:

ostress.exe

-Q"EXEC ThreadPoolWaits.dbo.MyCustomUniqueStoredProcedureName_ReadWorkload"

-n600

When you execute that command prompt, it takes a few seconds until ostress.exe has

created the 600 worker threads, and nothing special happens. Seems so far so good.

Let's now analyze the situation and create a new connection through SQL Server Management

Studio to your SQL Server instance. Oops, the connection can't be made:

SQL Server isn't responding anymore!!! This makes sense, because we have now exhausted

the maximum available worker threads. Almost all submitted requests to SQL Server

are currently waiting for a Shared Lock (LCK_M_S wait type), and

all the other ones can't be enqueued inside SQL Server because no worker threads are

available anymore (THREADPOOL wait type). But how can we troubleshoot

that scenario now? Restarting SQL Server isn't always really an option…

Fortunately Microsoft provides since SQL Server 2005 the so-called Dedicated

Admin Connection (DAC). With this connection you are able to log into SQL

Server even when you have worker thread starvation or high memory pressure, because

the DAC has its own

  • Scheduler
  • Memory Node
  • TCP Port

inside SQLOS. Therefore SQL Server is able to accept and serve the DAC connection

– even in high sophisticated troubleshooting scenarios like this one. But there is

only one available DAC for the whole SQL Server instance, which must be also taken

into account! When you want to connect through the DAC, you have to use the following

syntax: admin:<servername> where <servername> is

the name of your SQL Server instance. So let's start up a new instance of SQL Server

Management Studio and log into SQL Server through the DAC. Please be aware that you

don't connect the Object Explorer through the DAC, because the DAC isn't supported

for the Object Explorer. You can only use a simple query window that connects through

the DAC:

When you have successfully connected through the DAC, you are now able to run your

diagnostic queries. You must be also aware that the DAC doesn't support Auto Completion,

because Auto Completion uses its own connection in the background – so you have to

know the DMVs you want to use for troubleshooting J.

In the first step we can check sys.dm_exec_requests which requests

are currently waiting inside SQL Server:

SELECT

r.command,

r.sql_handle,

r.plan_handle,

r.wait_type,

r.wait_resource,

r.wait_time,

r.session_id,

r.blocking_session_id

FROM sys.dm_exec_requests r

INNER JOIN sys.dm_exec_sessions s ON r.session_id = s.session_id

WHERE s.is_user_process = 1

GO

In my configuration this query returns me 547 rows, which means 547 user requests

(WHERE s.is_user_process = 1) are currently waiting inside SQL Server.

But we have started our stored procedure with 600 concurrent users… sys.dm_exec_requests shows

you only those requests that have an underlying worker thread inside SQL Server, because

those requests are currently executing inside SQL Server. But where are the others?

Those other pending requests are only accessible through sys.dm_os_waiting_tasks –

they have a wait type of THREADPOOL:

SELECT * FROM sys.dm_os_waiting_tasks

WHERE wait_type = 'THREADPOOL'

GO

They are just waiting until a new thread from the worker pool gets free. But in our

scenario every thread is currently suspended and bound to a user request, therefore

those requests will wait forever! You can also see the THREADPOOL wait

type only inside sys.dm_os_waiting_tasks and never in sys.dm_exec_requests,

because a request in sys.dm_exec_requests is already bound to a worker

thread inside SQL Server. When you look back to the output of sys.dm_exec_requests you

can also see the columns session_id and blocking_session_id at

the end of the result set. Those 2 columns are showing you the blocking chain inside

SQL Server:

As you can see almost every session has a blocking_session_id of

56, and the session_id 56 has a blocking_session_id of

52. The session_id 52 is our head blocker! Let's further analyze

the session of the head blocker:

-- Analyze the head

blocker session

SELECT

login_time,

[host_name],

[program_name],

login_name

FROM sys.dm_exec_sessions

WHERE session_id = 52

GO

-- Analye the head

blocker connection

SELECT

connect_time,

client_tcp_port,

most_recent_sql_handle

FROM sys.dm_exec_connections

WHERE session_id = 52

GO

The most interesting column is here most_recent_sql_handle from sys.dm_exec_connections which

we can use to retrieve the executed SQL statement. When you use the DMF sys.dm_exec_sql_text and

pass in the value of the most_recent_sql_handle column you are able

to retrieve the executed SQL statement:

SELECT [text] FROM sys.dm_exec_sql_text(0x01001A0015BE5D3170CC4483000000000000000000000000)

GO

This SELECT statement will return you the following string:

-- Begin a transaction that never

commits... BEGIN TRANSACTION UPDATE [SomeCrazyUniqueTableName_6EFF088F-443B-4EBC-A4C7-9FC146D2EE49]

WITH (TABLOCKX) SET [MyUniqueColumnName2_438B7184-B476-48A4-B5FA-DC34B99FA0A4]

= 2

This is our initial query where we left our transaction open. By now we have tracked

down the problematic query that had leaded to THREADPOOL and LCK_M_S waits,

and finally we can kill that session through the DAC:

KILL 52

GO

Now it takes several seconds until the session is killed, and finally our blocking

scenario is gone. By now your SQL Server is again able to accept new connections and

will work in the usual way. When you are done with troubleshooting through the DAC

connection, don't forget to close that special connection, because there is only one

DAC available for the whole SQL Server instance! When you afterwards look into the

SQL Server Error Log, you will also see a message like the following one:

New queries assigned

to process on Node 0 have not been picked up by a worker thread in the last 1680 seconds.

Blocking or long-running queries can contribute to this condition, and may degrade

client response time. Use the "max worker threads" configuration option to increase

number of allowable threads, or optimize current running queries. SQL Process Utilization:

0%. System Idle: 96%.

This is also an indication that you had worker thread starvation inside your SQL Server

instance. As a side-effect this scenario has also leaded to so-called Deadlocked

Schedulers, which Amit Banerjee describes

in more detail here.

When SQL Server encounters Deadlocked Schedulers, SQL Server will write out a Stack

Dump to your SQL Server LOG directory. You can also see a Deadlocked

Scheduler inside the SQL Server Error Log:

Sometimes I see customers which just blindly reconfigure the max worker threads setting

inside SQL Server, because they think they need more worker threads for their workload.

But as with almost every problem in SQL Server, there is some root cause which has

leaded to the problem that you are currently seeing. In our scenario the root cause

was an uncommitted transaction, which leaded to a blocking scenario, which leaded

to thread starvation, which finally leaded to an unresponsive SQL Server. As you can

see from this explanation, there could be a very long chain until you find your root

cause – so keep that in mind for your next troubleshooting scenarios.

To make it easy for you to reproduce that special scenario, you can download the needed

scripts from here.

Thanks for reading!

-Klaus

Rate

β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜…

You rated this post out of 5. Change rating

Share

Share

Rate

β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜…

You rated this post out of 5. Change rating