A couple of weeks ago I gave you some scripts to collect wait stats from your SQL Server instance over time. Assuming that you downloaded those scripts and started using them right away, you now have two weeks worth of wait stats data collected. Do you know what to do with it?
It’s T-SQL Tuesday time again. The topic this month is “waits”. Just my luck – I wrote an amazing post on waits two weeks ago. Since reposting that one would violate the rules of T-SQL Tuesday, I wrote a followup! Hopefully you’re using the scripts from that earlier post and have collected some wait stats data.
Collecting all of that data means nothing if you don’t know how to make use of it. Your boss and your users don’t care – they just want to know why the server is slow now, or why it was slow yesterday. As a followup to that previous post, I’m going to show you some simple ways to do that using the data that you’ve been collecting.
Let’s start with some simple analysis – over the past seven days, what has our SQL instance spent the most time waiting on? Run this query in the database where you’re writing the wait stats data:
-- How many days past to report on? DECLARE @PreviousDays INT = 7; DECLARE @Command VARCHAR(MAX); -- Build dynamic UNION statement to combine collection tables WITH Dates_CTE (DateValue) AS ( SELECT DateValue = CAST(DATEADD(DAY, -1, GETDATE()) AS DATE) UNION ALL SELECT DateValue = CAST(DATEADD(DAY, -1, DateValue) AS DATE) FROM Dates_CTE WHERE DateValue > DATEADD(DAY, -@PreviousDays, GETDATE()) ) SELECT @Command = STUFF(( SELECT ';SELECT * FROM WaitStats_Log_' + CONVERT(VARCHAR(25), DateValue, 112) FROM Dates_CTE FOR XML PATH('')), 1, 1, ''); -- Build dynamic SQL statement to sum the top 10 wait types SET @Command = ' SELECT TOP 10 WaitType, TotalWaitSeconds = SUM(Wait_Seconds) FROM (SELECT * FROM WaitStats_Log UNION ' + REPLACE(@Command, ';', ' UNION ') + ') AS TotalWaits GROUP BY WaitType ORDER BY SUM(Wait_Seconds) DESC;'; -- Execute dynamic SQL statement EXECUTE(@Command);
You should get back some data that maybe looks something like this:
What does this tell us? Well, it tells us a couple of things. First, it tells us that this SQL instance spent more time waiting for network I/O than anything else. It also tells us that it spent less than 2% of the week (there are 604,800 seconds in a week) waiting on that network I/O. Unless somebody is complaining, I’m done here – I’m not chasing this any further for 2%.
For sake of example however, if somebody were complaining about performance, and these were the results from my SQL instance, I can comfortable say “it’s not a database problem”. Unless my server is experiencing network issues, which will be causing me other problems, these results point to an application problem. When an application requests data from SQL Server, the data is collected (typically by running a query) and then returned to the application. As SQL Server returns the data, it waits for the application to acknowledge that it received the data. This waiting is reported as ASYNC_NETWORK_IO. If the application can’t consume the data fast enough, we’ll see high ASYNC_NETWORK_IO waits.
Ok, that’s great, but why was the server slow between 3:00am and 4:00am this morning? Can you answer that, smart guy? Yep:
SELECT WaitType, TotalWaitSeconds = SUM(Wait_Seconds) FROM WaitStats_Log WHERE CollectionTime >= DATEADD(HOUR, 3, CAST(CAST(GETDATE() AS DATE) AS DATETIME)) AND CollectionTime < DATEADD(HOUR, 4, CAST(CAST(GETDATE() AS DATE) AS DATETIME)) GROUP BY WaitType ORDER BY SUM(Wait_Seconds) DESC
Seems pretty obvious – backups run at 3:00am. I think we can blame the slowness on backups. Nothing to see here, move along.
Taking a proactive stance – in terms of wait time, what has been the worst hour over the past week? Easy to figure that out since we have supporting data:
-- How many days past to report on? DECLARE @PreviousDays INT = 7; DECLARE @Command VARCHAR(MAX); -- Build dynamic UNION statement to combine collection tables WITH Dates_CTE (DateValue) AS ( SELECT DateValue = CAST(DATEADD(DAY, -1, GETDATE()) AS DATE) UNION ALL SELECT DateValue = CAST(DATEADD(DAY, -1, DateValue) AS DATE) FROM Dates_CTE WHERE DateValue > DATEADD(DAY, -@PreviousDays, GETDATE()) ) SELECT @Command = STUFF(( SELECT ';SELECT * FROM WaitStats_Log_' + CONVERT(VARCHAR(25), DateValue, 112) FROM Dates_CTE FOR XML PATH('')), 1, 1, ''); -- Build dynamic SQL statement to sum the wait types by hour SET @Command = ' SELECT TOP 10 TargetHour = DATEADD(HOUR, DATEDIFF(HOUR, 0, CollectionTime), 0), TotalWaitSeconds = SUM(Wait_Seconds) FROM (SELECT * FROM WaitStats_Log UNION ' + REPLACE(@Command, ';', ' UNION ') + ') AS TotalWaits GROUP BY DATEADD(HOUR, DATEDIFF(HOUR, 0, CollectionTime), 0) ORDER BY SUM(Wait_Seconds) DESC;'; -- Execute dynamic SQL statement EXECUTE(@Command);
From this, we can see that the 4:00am hour last Monday was the worst hour in the past week. We can also see that 4:00am is routinely the worst hour of the day – might be worth investigating.
First, let’s see what was going on last Monday between 4:00am and 5:00am. To find out, we’re going to use the other collection table that I showed you in the previous article – the one collecting output from sp_WhoIsActive. Specifically, we’re interested in the collection table from 12-02-2013:
SELECT * FROM WhoIsActive_Log_20131202 WHERE collection_time >= '12/2/2013 04:00' AND collection_time < '12/2/2013 05:00' ORDER BY CPU DESC
Sorry, I can’t show you the full query (it’s a secret), but from what you see here we can tell a lot. Between 4:00am and 5:00am on December 2nd, multiple UPDATE statements were executed, each taking longer than 10 minutes to run. The longest took 15 minutes. Let’s check the 4:00am-5:00am hour on a different day:
SELECT * FROM WhoIsActive_Log_20131208 WHERE collection_time >= '12/8/2013 04:00' AND collection_time < '12/8/2013 05:00' ORDER BY CPU DESC
Again, a series of UPDATE statements taking 10-15 minutes to complete. Some of these we recognize from the collection from December 2nd. If *I* were analysing this SQL instance, I would suspect a nightly batch process of some kind that runs at 4:00am. At the very least, I have a set of queries that I can look into tuning.
What you’ve seen here really just scratches the surface of what you can do with wait stats analysis. By collecting wait stats on your server over time, combined with the active processes during that same time, you can easily spot potential problems on your own, or dig deep into user complaints of poor performance.
Setup your collection scripts, start collecting data, and then USE the data that you’re collecting. If you’re not using the data, you’re doing it wrong…