In today’s blog posting I want to talk more about the Wait Statistics in SQL Server and how they can help you understand why your SQL Server is currently slow. For me the Wait Statistics are the most important concept in SQL Server when it comes to performance troubleshooting.
Why Queries are Waiting
Every time you execute a query in SQL Server, the query always has to wait. What? A query has always to wait? Yes, you read correctly: when you execute a query, the query always has to wait. And the reasons why a query has to wait are tracked by SQL Server through the so-called Wait Statistics. Before I get down into the details of the Wait Statistics itself, I want to talk about the reasons that queries always have to wait in SQL Server. Query waits occur in SQL Server for 2 reasons:
- Resource Waits
- Cooperative Scheduling Waits
Let’s have a more detailed look at both types of waits. A Resource Wait occurs when you wait on an external resource. I want to give you here some examples. Every time a query requests a page from the Buffer Pool, and the page is currently not cached, the Buffer Manager has to perform an asynchronous I/O operation to your storage to read the page from your physical storage into the Buffer Pool. And access to your physical storage is very slow. For this reason SQL Server takes your query off the CPU, and the query just waits until the asynchronous I/O operation has completed (in the meantime other queries can use the CPU resources more effectively). And finally your query continues with its execution.
The same thing happens when you have to acquire a lock – when you want to read or change data. When someone else has already acquired an incompatible lock, your query has to wait until the lock can be granted. And in the meantime SQL Server takes you off the CPU again, and the query has to wait until the incompatible lock of the other query is released, so that the query itself can acquire the requested lock.
In addition to the Resource Waits, SQL Server has query waits because of the Cooperative Scheduling that the SQLOS (SQL Server Operating System) implements internally. SQL Server bypasses the preemptive scheduling of the Windows OS, and schedules its threads itself. Because of this design SQL Server can scale more easily and provide you with better throughput. Because SQL Server itself decides when a query is actively running on the CPU, SQL Server also has to decide when a query has to be taken off the CPU, so that another query can run actively on that CPU and continue with its query execution. And for this reason SQL Server takes your query off the CPU as soon as the query has exceeded its so-called Quantum.
The quantum defines the time slice that the query can spend actively on the CPU. This time slice is 4 milliseconds long in SQL Server. This means that SQL Server takes your query off the CPU as soon as the query has done work that exceeded the time slice of 4 milliseconds. And therefore a query always just has to wait in SQL Server. If there are no Resource Waits, the exceeded quantum will kick in, and the query will be going off the CPU. Your queries are just always waiting!
The Query Life Cycle
By now we know that a query always has to wait in SQL Server. Let’s have a more detailed look at that. When you execute a query, then the query always goes through 3 different states, as you can see in the following picture:
Let’s talk about these 3 states in more detail. As long as your query is running actively on your CPU, the query is in the RUNNING state. The RUNNING state means that your query is currently performing some work. Being in the running state is always your primary goal. When SQL Server takes your query off the CPU, then the query is moved into the SUSPENDED state. The query waits as long as needed in the SUSPENDED state, until the requested resource is available (think back to pages that were read from your physical storage, or to an incompatible lock that can’t be granted immediately).
When the requested resource is available, then SQL Server moves your query into the RUNNABLE state. The RUNNABLE state means that your query is ready for continuing its execution, but it needs one additional essential thing: a CPU to run on. When there is currently no CPU available (because other queries are currently in the RUNNING state), then the query has to spend some time in the RUNNABLE state. Finally when a CPU becomes available, the query moves into the RUNNING state, and everything happens over and over again. A simple query can go hundreds, or even thousands of times through that query life cycle during its execution. I have also demonstrated this behaviour in one of my previous SQL Server Quickies.
Analyzing Wait Statistics
All these state transitions are tracked by SQL Server and are reported back to us through the Wait Statistics. SQL Server exposes the Wait Statistics through the Dynamic Management View sys.dm_os_wait_stats. Every record that is returned from this DMV is one wait reason in SQL Server. In SQL Server 2014 you have a total of 771 different reasons why a query can wait. What? 771 different reasons? Are you kidding me? That’s a lot! That’s right! But normally you just deal with just some specific wait reasons, because every one of us is dealing with almost the same performance problems in SQL Server:
- Slow Storage Subsystems
- Bad Indexing Strategies
- Locking/Blocking Problems
- Inefficient Parallel Execution Plans
- CPU Pressure
- …
When I have a slow SQL Server in front of me, I always have a more detailed look at the Wait Statistics as the first step, because they tell me why queries in SQL Server were waiting. But the Wait Statistics in SQL Server are just symptoms, not the root cause itself! Maybe SQL Server tells you through the Wait Statistics that you had some waits because of blocking situations in the past. But maybe you just have a bad indexing strategy and you are missing a very important Non-Clustered Index on a larger table that led to this blocking situation? With an additional Non-Clustered Index you provide SQL Server an additional data access path, and maybe your blocking situation can be simply solved? That was just one example (out of several) where the symptom was not the root cause.
Summary
In this blog posting I have given you an overview of why queries wait in SQL Server, and how these waits are tracked through the Wait Statistics. The most important thing about the Wait Statistics is the fact that SQL Server tells you only symptoms and not the underlying root causes. Your job as a troubleshooting person is to read and understand the Wait Statistics and finally dig more into the details of your SQL Server installation to find out the underlying root cause.
If you want to know more about Wait Statistics, you can also watch my SQL Server Quickie, which was released about this topic some weeks ago.
Thanks for your time,
-Klaus