The Accidental DBA - Wait Stats

Question

Post reply

The Accidental DBA - Wait Stats

stephen.plant

SSC Eights!

Points: 811

March 6, 2017 at 3:11 am

#401737

Hello All,

Am after a little bit of advice.

In my ever expanding career as a DBA with my current company, a role I fell in to I am trying to diagnose why a production SQL instance on occasion is reported as slowing down.

This has been an ongoing battle with a consortium member who has their own application that interfaces with our system.

One of the topics I am exploring is Waits, a previous DBD had gathered some stats as -

WaitType	NumWaits	WaitSecs	WaitPct
CXPACKET	8862234	60996.36	87.76
LATCH_EX	3348554	3249.08	4.68
PREEMPTIVE_OLEDBOPS	16390	1114	1.61
WRITELOG	249757	1004.57	1.45
OLEDB	9269	879.97	1.27
SOS_SCHEDULER_YIELD	599550	532.33	0.77
LCK_M_IX	139	519.94	0.75
LCK_M_S	16304	376.91	0.55
PAGEIOLATCH_SH	18825	200.43	0.29
PAGEIOLATCH_EX	18237	144.75	0.21

And reported that -

“The large number of CXPACKET waits, LATCH_EX waits and ACCESS_METHODS_DATASET_PARENT latch waits indicate that parallel query is going wrong.
The wrong way to fix this would be to switch parallelism off or to increase the cost threshold for parallelism further.
Both of these approaches would prevent queries that would benefit from parallel query from being able to use it.”

I have gathered some more up to date waits as

WaitType	NumWaits	WaitSecs	WaitPct
CXPACKET	1157053340	5625472.52	41.11
WRITELOG	504317894	1346463.45	9.84
OLEDB	26333852	1225484.54	8.95
PREEMPTIVE_OLEDBOPS	28305166	1212049.53	8.86
LCK_M_S	25643443	937286.31	6.85
PAGEIOLATCH_SH	41028159	738076.13	5.39
LCK_M_IX	63552	650726.77	4.76
SOS_SCHEDULER_YIELD	679518705	512659.17	3.75
PAGEIOLATCH_EX	24280253	327382.99	2.39
LCK_M_U	24353974	248477.78	1.82
ASYNC_IO_COMPLETION	42007	177588.45	1.3

I am aware of the Wait types but could do with a little guidance.

Thanks in advance.

Stephen

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 1

Stephen

Yes, the advice from the previous DBA is good. What I would do is speak to users to find our excatly what they're doing when the application is running slow. Find out which queries are the most expensive in terms of elapsed time and see if you can tune them. That won't necessarily be the queries that take the longest to run, because you'll also want to take into account how often they are executed as well. And read Paul Randal's pages on wait stats, especially with regard to the wait types in your list.

Are you capturing wait stats regularly, or is what you posted the raw data from sys,dm_os_wait_stats, which is cumulative since wait stats were last cleared?

John

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 2

I don't like looking at percentage waits, and imo, waits aggregated over more than an hour are useless, I can't tell what might have caused them.

Create a job that writes wait stats to a table, run it every 30 minutes. Use Glenn Berry's latest DMV scripts for the list of waits to ignore. Track over a day or two and then look at the highest waits over periods that you're interested in.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

John Mitchell-245523 - Monday, March 6, 2017 3:25 AM
Yes, the advice from the previous DBA is good.

Yup, although since the latest waits have CXPacket but no Latch_EX, there's probably less of a problem with parallel queries now than there was when the initial waits were captured.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

stephen.plant SSC Eights! Points: 811 More actions · Answer 4

Hi Both,

Thanks for thodse replies.

Sadly this is a remote instance that I have no direct access to and have to do everything remotely.

The first wait stats were taken in October 2015 and I made a site visit this February, I think they are all cumulative.

I can't make changes like setting up a job, I had to jump through hoops to get them to put in some server side tracing so I could profile some specific RPC durations over 5 seconds.

I understand than now that LATCH_EX is no longer present our issues with parallelism has reduced, but other things are now causing us issues which I am trying to identify.

Shall take a look a Glenn's DMV to see if it's any later than the bits I've already pinched from the very useful armoury.

Any further tips welcomed.

Stephen

John Mitchell-245523 SSC Guru Points: 148809 More actions · Answer 5

Stephen

The only other thing I would say is that you need to explain to whoever owns the server that if they want you to manage it, you need proper access to it. If they're stopping you from doing stuff like capturing wait stats then you're going to have no control when performance deteriorates even furher.

John

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

I think they are all cumulative.

They are. Cumulative since SQL started (most people don't clear wait stats).

Can you get permission to add a job? Or ask whoever has access to run a query extracting wait stats every half an hour and send you the results?
Because it's really hard to say anything useful about stats aggregated over days or weeks. They're probably skewed by maintenance jobs, and hence hard to interpret.

If you want to use the waits to diagnose why the server is slow at particular times, then you need to see what waits were accumulated when the server was slow

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

GilaMonster - Monday, March 6, 2017 3:56 AM
I think they are all cumulative.
They are. Cumulative since SQL started (most people don't clear wait stats).
Can you get permission to add a job? Or ask whoever has access to run a query extracting wait stats every half an hour and send you the results?
Because it's really hard to say anything useful about stats aggregated over days or weeks. They're probably skewed by maintenance jobs, and hence hard to interpret.
If you want to use the waits to diagnose why the server is slow at particular times, then you need to see what waits were accumulated when the server was slow

Spot on as usual Gail.

Differential wait stats analysis (and same for file IO stalls) are bread-and-butter tuning mechanisms. And you need shorter intervals to be helpful. Also, I very frequently run these two with just 3 or 5 minutes of delay for the interval, captured during a period that I know is experiencing performance problems.

Have you also looked for top poorly performing queries? Or how many rows they are trying to spool back to the client? Or reasons OUTSIDE of SQL Server why the application could be slow?

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

stephen.plant SSC Eights! Points: 811 More actions · Answer 8

Thanks for the replies, every day is a school day.

I am trying to get some wait stats over a period of time starting with a 1 min delay so far for a starter.

As mentioned, I have no access to this server me in the UK and the system in France with no outside country access so only get to hear about performance issues post incident.

I am gathering some stats on queries this afternoon, also, the client application that hooks into our instance is Java based and uses our procs but exec sp_prepexec them?

Thanks again

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 9

stephen.plant2 - Monday, March 6, 2017 7:12 AM
Thanks for the replies, every day is a school day.
I am trying to get some wait stats over a period of time starting with a 1 min delay so far for a starter.
As mentioned, I have no access to this server me in the UK and the system in France with no outside country access so only get to hear about performance issues post incident.
I am gathering some stats on queries this afternoon, also, the client application that hooks into our instance is Java based and uses our procs but exec sp_prepexec them?
Thanks again

All you mentioned is bad, sorry to say. I am EXCEPTIONALLY good at performance tuning SQL Server - finding and making problems go away. But often even I cannot fix things without some forensic information WHILE THE EVENT IS HAPPENING. It is just silly to be reactive and not be able to access the box you are supposed to be tuning. And Java apps are just horrible. I have never, in 20+ years of consulting on SQL Server, come across one that wasn't just awful. In all fairness I don't get called in for routine ops, and thus pretty much all applications I come across suck. 😎

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

lmarkum SSCertifiable Points: 7989 More actions · Answer 10

Stephen,

Can you provide queries to the application care takers and ask them to run them right after a problem is reported? If so, you can offer them something like sp_BlitzCache from Brent Ozar or a similar query against the cache to get the most expensive queries. If that data is gathered within a few minutes of the reported problem, then you can work on the problem.

stephen.plant SSC Eights! Points: 811 More actions · Answer 11

lmarkum - Monday, March 6, 2017 10:30 PM
Stephen,
Can you provide queries to the application care takers and ask them to run them right after a problem is reported? If so, you can offer them something like sp_BlitzCache from Brent Ozar or a similar query against the cache to get the most expensive queries. If that data is gathered within a few minutes of the reported problem, then you can work on the problem.

It is quite a difficult task to undertake to get our consortium partners to act on this. They definitely would not nor would we run a standard Blitz script on our prod instances.

We are investigating a method to give our first line support based in country to access a more real time analysis method now.

Thanks for all the replies.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 12

Middle and senior IT management at your company needs to be taken out behind the wood shed and soundly thrashed. Their decisions and failure to act are a root cause of what is going on.

In any case, good luck with it! 🙂

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service