"Never wait for trouble" – Chuck Yeager
Chuck Yeager is an Air Force veteran and test pilot who was the first person to break the sound barrier, in 1947. The above quote from Chuck sums up nicely a point that this series has been trying to drive home since the start: be proactive. Chuck achieved what he did through a mix of excellent technical abilities, thorough preparation, and calm demeanor in the face of potential disaster. Those are all qualities exhibited by a good Database Administrator, who will use baseline data to seek out potential causes of trouble on SQL Server, and can then act quickly to avert real trouble. Part 1 of this series covered five reasons to start capturing baseline data; Part 2 explained how to get started, Part 3 showed how to track disk space usage and in this final article of the series, we consider wait statistics.
Every time a SQL Server execution thread has to wait, before starting or proceeding with its work, SQL Server tracks the wait in the sys.dm_os_wait_Stats
DMV. It tracks both the source and length of the wait, and the waits are accumulated over time. A thread might need to wait, for example, for a resource, such as CPU, or for data to be read from disk into memory, or to acquire a lock. By establishing a baseline for such wait statistics under normal workload, we can understand our normal wait patterns. By tracking the data over time, we can look for sudden changes in wait behavior, and therefore determine the root cause of any identified issue.
In this article, I'll discuss strategies and techniques for capturing, reviewing and managing wait statistics data, so that you can identify worrying trends, such as a drift up in the prominence of a certain type of wait, and investigate the cause quickly and efficiently.
In order to work through this article, you'll need SQL Server 2005 or higher and a BaselineData
database in which to store baseline information (see the Part 2 link above, where you'll find the scripts to create this database).
Brief Background on Wait Statistics
I'm not going to attempt anything more than a brief introduction to wait statistics here, as I want to stay focused on the topic of collecting baselines. However, see the Further Reading list at the end of the article for a collection of articles and posts that provide deeper background.
As mentioned in the introduction, SQL Server knows when execution threads have had to wait for a resource, and for how long they have had to wait. This data is exposed in the sys.dm_os_wait_stats
DMV.
Waits and Queues
The monitoring of wait statistics forms part of a popular tuning methodology known as "waits and queues", first described by Tom Davidson in this whitepaper: http://sqlcat.com/whitepapers/archive/2007/11/19/sql-server-2005-waits-and-queues.aspx.
The values stored in sys.dm_os_wait_stats
are running totals, accumulated across all sessions since the last server restart, or a manual reset of the statistics using the DBCC SQLPERF
command (more on this shortly).
It is extremely important to understand that all SQL Server instances will have waits regardless of how well you tune and optimize them. Therefore, if this data accumulates over a long period and then we interrogate the DMV for, say, the "top 5 waits on a SQL Server instance", it will still be very difficult to know if any of these waits represent a potential problem, or are just "normal" for that instance. Our goal is to understand the usual waits for an instance. Once we know what's normal, then we can focus our tuning efforts, and we have a reference in the event that performance suddenly degrades.
Clearing Wait Stats
By default, SQL Server clears the cumulative wait statistics for an instance from the sys.dm_os_wait_stats
DMV upon instance restart. In addition, a DBA can clear the statistics manually using DBCC SQLPERF
(http://msdn.microsoft.com/en-us/library/ms189768.aspx).
It is not required to clear out wait statistics on a regular basis. However, in order to analyze this data in a meaningful way, both to understand what "normal" behavior is and also quickly spot abnormalities when comparing data from a previous period to current data, it is important that DBAs adopt a clear policy on the timing and frequency of clearing wait statistics.
If one DBA is clearing wait statistics at 6 AM daily, and another DBA is capturing the information at 7 AM, the data only represents the waits accumulated during an hour's workload, which may not represent the normal workload.
Ideally, we need to collect wait statistics for a period that represents normal activity, without clearing them. At the same time, however, the DBAs will want to understand how significant changes, such as adding a new index, or altering a configuration setting, affect the pattern of waits for the instance. Clearing the wait statistics immediately after making the change can help us understand the impact of the modification.
There are other reasons that may compel a DBA to clear waits statistics. For example, some companies use a single third-party utility for all backups, regardless of the application, resulting in an extended time to complete SQL Server database backups. While there are alternate options, which can perform very fast SQL Server backups (e.g. native SQL backups, dedicated third-party SQL backup applications), the DBA is unable to use them. In such cases, the DBA knows backup performance to be poor, but is unable to make improvements and instead may opt to clear out wait statistics after each backup job completes, to prevent any waits from the actual backup from influencing the interpretation of the wait statistics as a whole. Alternatively, the DBA can filter BACKUP*
waits from the output.
The queries in Listing 1 will reveal when wait statistics were last cleared by an instance restart, as well as if, and when, someone last cleared them manually. Simply compare the two values to see if wait statistics have been manually cleared since the last restart.
SELECT [wait_type] , [wait_time_ms] , DATEADD(SS, -[wait_time_ms] / 1000, GETDATE()) AS "Date/TimeCleared" , CASE WHEN [wait_time_ms] < 1000 THEN CAST([wait_time_ms] AS VARCHAR(15)) + ' ms' WHEN [wait_time_ms] BETWEEN 1000 AND 60000 THEN CAST(( [wait_time_ms] / 1000 ) AS VARCHAR(15)) + ' seconds' WHEN [wait_time_ms] BETWEEN 60001 AND 3600000 THEN CAST(( [wait_time_ms] / 60000 ) AS VARCHAR(15)) + ' minutes' WHEN [wait_time_ms] BETWEEN 3600001 AND 86400000 THEN CAST(( [wait_time_ms] / 3600000 ) AS VARCHAR(15)) + ' hours' WHEN [wait_time_ms] > 86400000 THEN CAST(( [wait_time_ms] / 86400000 ) AS VARCHAR(15)) + ' days' END AS "TimeSinceCleared" FROM [sys].[dm_os_wait_stats] WHERE [wait_type] = 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP'; /* check SQL Server start time - 2008 and higher */SELECT [sqlserver_start_time] FROM [sys].[dm_os_sys_info]; /* check SQL Server start time - 2005 and higher */SELECT [create_date] FROM [sys].[databases] WHERE [database_id] = 2
Ultimately, it is at the discretion of the DBAs to decide when to clear out wait statistics, if at all. If in doubt, collect your wait statistics (see the next section) for a period that will capture a representative workload (for example one month). The next month, collect the stats on am more regular schedule, such as every Sunday, after scheduled code changes, and then immediately clear out the sys.dm_os_wait_stats
DMV. Compare each of the four 1-week data sets to the 1-month set: do different wait patterns exist (for example, perhaps the last week of the month, when various business reports run, has different waits), or are they consistent across all five sets? If you see differences then you may want to consider clearing out the stats on a regular (e.g. weekly) basis.
Reviewing the collected wait stats is discussed in more detail in the Reviewing Wait Statistics section.
Collecting Wait Statistics for Analysis
In order to collect wait statistics, on a regular schedule, the first step is to create a table to hold the information, as shown in Listing 2 (as described previously, this script assumes the BaselineData
database exists).
USE [BaselineData]; GO IF NOT EXISTS ( SELECT * FROM [sys].[tables] WHERE [name] = N'WaitStats' AND [type] = N'U' ) CREATE TABLE [dbo].[WaitStats] ( [RowNum] [BIGINT] IDENTITY(1, 1) , [CaptureDate] [DATETIME] , [WaitType] [NVARCHAR](120) , [Wait_S] [DECIMAL](14, 2) , [Resource_S] [DECIMAL](14, 2) , [Signal_S] [DECIMAL](14, 2) , [WaitCount] [BIGINT] , [Percentage] [DECIMAL](4, 2) , [AvgWait_S] [DECIMAL](14, 2) , [AvgRes_S] [DECIMAL](14, 2) , [AvgSig_S] [DECIMAL](14, 2) ); GO CREATE CLUSTERED INDEX CI_WaitStats ON [dbo].[WaitStats] ([RowNum], [CaptureDate]);
The second step is simply to schedule a query to run on a regular basis, which captures the wait information to this table. Listing 3 derives from a query presented in Paul Randal's wait statistics post. This query uses a CTE to capture the raw wait statistics data, and then manipulates the output to include averages, for example, average wait (AvgWait_S
), and average signal wait (AvgSig_S
). I included an additional, optional INSERT
as it helps to separate each set of data collected when reviewing the output.
USE [BaselineData]; GO INSERT INTO dbo.WaitStats ( [WaitType] ) VALUES ( 'Wait Statistics for ' + CAST(GETDATE() AS NVARCHAR(19)) ); INSERT INTO dbo.WaitStats ( [CaptureDate] , [WaitType] , [Wait_S] , [Resource_S] , [Signal_S] , [WaitCount] , [Percentage] , [AvgWait_S] , [AvgRes_S] , [AvgSig_S] ) EXEC ( ' WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N''CLR_SEMAPHORE'', N''LAZYWRITER_SLEEP'', N''RESOURCE_QUEUE'', N''SQLTRACE_BUFFER_FLUSH'', N''SLEEP_TASK'', N''SLEEP_SYSTEMTASK'', N''WAITFOR'', N''HADR_FILESTREAM_IOMGR_IOCOMPLETION'', N''CHECKPOINT_QUEUE'', N''REQUEST_FOR_DEADLOCK_SEARCH'', N''XE_TIMER_EVENT'', N''XE_DISPATCHER_JOIN'', N''LOGMGR_QUEUE'', N''FT_IFTS_SCHEDULER_IDLE_WAIT'', N''BROKER_TASK_STOP'', N''CLR_MANUAL_EVENT'', N''CLR_AUTO_EVENT'', N''DISPATCHER_QUEUE_SEMAPHORE'', N''TRACEWRITE'', N''XE_DISPATCHER_WAIT'', N''BROKER_TO_FLUSH'', N''BROKER_EVENTHANDLER'', N''FT_IFTSHC_MUTEX'', N''SQLTRACE_INCREMENTAL_FLUSH_SLEEP'', N''DIRTY_PAGE_POLL'') ) SELECT GETDATE(), [W1].[wait_type] AS [WaitType], CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S], CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S], CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S], [W1].[WaitCount] AS [WaitCount], CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage], CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S], CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S], CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS], [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage] HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95;' ); GO
We should capture wait statistics regularly, at least once a week or once a month. We could do so more frequently, perhaps daily, but remember that unless we are clearing the data regularly, they represent an aggregation of waits since the last restart. The longer the waits have been accumulating, the harder it may be to spot smaller changes in wait percentages. For example, let's say the system suffers a short period (one hour) of poor performance, during which the number of, and duration of, a certain wait type increases significantly. This 'spike' may be hard to spot if you're analyzing waits accumulated over a long period (e.g. a month), since the spike might not affect significantly the overall wait percentage for that period.
Reviewing Wait Statistics Data
You'll want to review regularly the waits for each SQL Server instance. If you capture them once a week, then check on the trending once a week. The simple SELECT
in Listing 4 retrieves from the dbo.WaitStats
table all the data captured for the last 30 days
SELECT * FROM [dbo].[WaitStats] WHERE [CaptureDate] > GETDATE() - 30 ORDER BY [RowNum];
If you need to view older data, adjust the number of days as necessary, or remove the predicate entirely. In some cases, it might be ideal to look at only the top wait for each set of data collected, as shown in Listing 5 (again, alter the number of days as needed).
SELECT [w].[CaptureDate] , [w].[WaitType] , [w].[Percentage] , [w].[Wait_S] , [w].[WaitCount] , [w].[AvgWait_S] FROM [dbo].[WaitStats] w JOIN ( SELECT MIN([RowNum]) AS [RowNumber] , [CaptureDate] FROM [dbo].[WaitStats] WHERE [CaptureDate] IS NOT NULL AND [CaptureDate] > GETDATE() - 30 GROUP BY [CaptureDate] ) m ON [w].[RowNum] = [m].[RowNumber] ORDER BY [w].[CaptureDate];
There are many ways in which to examine this data, but your focus, initially, should be to understand the top waits in your system and ensure they're consistent over time. Expect to tweak your capture and monitoring process in the first few weeks of implementation.
As discussed earlier, it's up to the DBAs to decide on a clear, consistent policy on how often to collect and analyze this data, and when to clear out the sys.dm_os_wait_stats
DMV. Here, by way of a starting point, I offer three possible options for clearing, capturing and reviewing this data:
Option 1:
- Never clear wait statistics
- Capture weekly (at the end of any business day)
- Review weekly
Option 2:
- Clear wait statistics on Sunday nights (or after a full weekly backup)
- Capture daily at the end of the business day
- Review daily, checking to see if the percentages for wait types vary throughout the week
Option 3:
- Clear wait statistics nightly (after full or differential backups complete)
- Capture daily, at the end of the business day (optional: capture after any evening or overnight processing)
- Review daily, checking to see how the waits and their percentages vary throughout the week (and throughout the day if capturing more than once a day)
The critical point to remember is that you are capturing this data to achieve a baseline, and to understand "normal" wait patterns on your systems. However, it's common, as you're reviewing this information, to identify existing or potential bottlenecks. This is a good thing. Having this information allows you to investigate an unexpected or high wait type, and determine the possible source of the bottleneck that caused the wait, before it becomes a production problem.
Managing Historical Data
As with all baseline data, it will cease to be relevant, after a certain point, and you can remove it from the BaselineData
database. The query in Listing 6 removes data older than 90 days, but you can adjust this value as appropriate for your environment. The overall size of the dbo.WaitStats
table will depend on the choices you make on how often to capture the data and how long to retain it.
DELETE FROM [dbo].[WaitStats] WHERE [CaptureDate] < GETDATE() – 90;
Alternatively, if you've been following the series you may have already implemented the dbo.usp_PurgeOldData
stored procedure (see Part 2), in which case you can modify it to incorporate this DELETE
statement, as shown in Listing 7.
IF OBJECTPROPERTY(OBJECT_ID(N'usp_PurgeOldData'), 'IsProcedure') = 1 DROP PROCEDURE usp_PurgeOldData; GO CREATE PROCEDURE dbo.usp_PurgeOldData ( @PurgeConfig SMALLINT , @PurgeCounters SMALLINT , @PurgeWaits SMALLINT ) AS BEGIN; IF @PurgeConfig IS NULL OR @PurgeCounters IS NULL BEGIN; RAISERROR(N'Input parameters cannot be NULL', 16, 1); RETURN; END; DELETE FROM [dbo].[ConfigData] WHERE [CaptureDate] < GETDATE() - @PurgeConfig; DELETE FROM [dbo].[ServerConfig] WHERE [CaptureDate] < GETDATE() - @PurgeConfig; DELETE FROM [dbo].[PerfMonData] WHERE [CaptureDate] < GETDATE() - @PurgeCounters; DELETE FROM [dbo].[WaitStats] WHERE [CaptureDate] < GETDATE() - @PurgeWaits; END;
To delete configuration data older than 30 days, performance counter data older than 7 days, and wait statistics older than 90 days, respectively, execute the following statement:
EXEC dbo.usp_PurgeOldData 30, 7, 90
Final Notes
Wait statistics are one of the best places for a Database Administrator to start when tuning a SQL Server environment or troubleshooting a performance issue. While wait statistics alone will not solve a problem, they are a key piece of information that will point you in the right direction, particularly when you have baseline values that you can reference.
The queries provided in this article should serve as a good starting point for any DBA to capture and review the wait statistics. As always, remember to add the dbo.WaitStats
table to your optimization tasks, and add indexes as needed to support new queries and reports.
This article concludes my series on capturing baselines on SQL Server instances. Success with baselines depends on having methods in place to both collect the required data and review it quickly. Start with the information that is most vital to the management of your solutions, and then build on that foundation over time. Remember that the effort you put in initially, to collect this data, will save time in the end. It may also mean the difference between spending all night in front of the server, trying to troubleshoot a performance issue, and spending an hour or two of time analyzing the baseline data to pinpoint and solve the problem, and returning to your own bed for a good night's sleep. Make the time, and make your job easier.
Further Reading
- Books Online reference for sys.dm_os_wait_stats http://technet.microsoft.com/en-us/library/ms179984.aspx
- Wait statistics, or please tell me where it hurtshttp://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/
- A Performance Troubleshooting Methodology for SQL Serverhttp://www.simple-talk.com/sql/performance/a-performance-troubleshooting-methodology-for-sql-server/
- Performance Tuning with SQL Server Dynamic Management Views (Free eBook)http://www.simple-talk.com/books/sql-books/performance-tuning-with-sql-server-dynamic-management-views/
- Baselining and Benchmarking (video)http://technet.microsoft.com/en-us/sqlserver/gg429821.aspx
- Waits and Queues (video)http://technet.microsoft.com/en-us/sqlserver/gg508897.aspx
- SQL Server: Performance Troubleshooting Using Wait Statistics (video)http://www.pluralsight.com/training/Courses/TableOfContents/sqlserver-waits
- SQL Server: Benchmarking and Baselining (video)http://www.pluralsight.com/training/Courses/TableOfContents/sqlserver-benchmarking