March 6, 2017 at 3:11 am
Hello All,
Am after a little bit of advice.
In my ever expanding career as a DBA with my current company, a role I fell in to I am trying to diagnose why a production SQL instance on occasion is reported as slowing down.
This has been an ongoing battle with a consortium member who has their own application that interfaces with our system.
One of the topics I am exploring is Waits, a previous DBD had gathered some stats as -
WaitType | NumWaits | WaitSecs | WaitPct |
CXPACKET | 8862234 | 60996.36 | 87.76 |
LATCH_EX | 3348554 | 3249.08 | 4.68 |
PREEMPTIVE_OLEDBOPS | 16390 | 1114 | 1.61 |
WRITELOG | 249757 | 1004.57 | 1.45 |
OLEDB | 9269 | 879.97 | 1.27 |
SOS_SCHEDULER_YIELD | 599550 | 532.33 | 0.77 |
LCK_M_IX | 139 | 519.94 | 0.75 |
LCK_M_S | 16304 | 376.91 | 0.55 |
PAGEIOLATCH_SH | 18825 | 200.43 | 0.29 |
PAGEIOLATCH_EX | 18237 | 144.75 | 0.21 |
And reported that -
βThe large number of CXPACKET waits, LATCH_EX waits and ACCESS_METHODS_DATASET_PARENT latch waits indicate that parallel query is going wrong.
The wrong way to fix this would be to switch parallelism off or to increase the cost threshold for parallelism further.
Both of these approaches would prevent queries that would benefit from parallel query from being able to use it.β
I have gathered some more up to date waits as
WaitType | NumWaits | WaitSecs | WaitPct |
CXPACKET | 1157053340 | 5625472.52 | 41.11 |
WRITELOG | 504317894 | 1346463.45 | 9.84 |
OLEDB | 26333852 | 1225484.54 | 8.95 |
PREEMPTIVE_OLEDBOPS | 28305166 | 1212049.53 | 8.86 |
LCK_M_S | 25643443 | 937286.31 | 6.85 |
PAGEIOLATCH_SH | 41028159 | 738076.13 | 5.39 |
LCK_M_IX | 63552 | 650726.77 | 4.76 |
SOS_SCHEDULER_YIELD | 679518705 | 512659.17 | 3.75 |
PAGEIOLATCH_EX | 24280253 | 327382.99 | 2.39 |
LCK_M_U | 24353974 | 248477.78 | 1.82 |
ASYNC_IO_COMPLETION | 42007 | 177588.45 | 1.3 |
I am aware of the Wait types but could do with a little guidance.
Thanks in advance.
Stephen
March 6, 2017 at 3:25 am
Stephen
Yes, the advice from the previous DBA is good. What I would do is speak to users to find our excatly what they're doing when the application is running slow. Find out which queries are the most expensive in terms of elapsed time and see if you can tune them. That won't necessarily be the queries that take the longest to run, because you'll also want to take into account how often they are executed as well. And read Paul Randal's pages on wait stats, especially with regard to the wait types in your list.
Are you capturing wait stats regularly, or is what you posted the raw data from sys,dm_os_wait_stats, which is cumulative since wait stats were last cleared?
John
March 6, 2017 at 3:26 am
I don't like looking at percentage waits, and imo, waits aggregated over more than an hour are useless, I can't tell what might have caused them.
Create a job that writes wait stats to a table, run it every 30 minutes. Use Glenn Berry's latest DMV scripts for the list of waits to ignore. Track over a day or two and then look at the highest waits over periods that you're interested in.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 6, 2017 at 3:28 am
John Mitchell-245523 - Monday, March 6, 2017 3:25 AMYes, the advice from the previous DBA is good.
Yup, although since the latest waits have CXPacket but no Latch_EX, there's probably less of a problem with parallel queries now than there was when the initial waits were captured.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 6, 2017 at 3:47 am
Hi Both,
Thanks for thodse replies.
Sadly this is a remote instance that I have no direct access to and have to do everything remotely.
The first wait stats were taken in October 2015 and I made a site visit this February, I think they are all cumulative.
I can't make changes like setting up a job, I had to jump through hoops to get them to put in some server side tracing so I could profile some specific RPC durations over 5 seconds.
I understand than now that LATCH_EX is no longer present our issues with parallelism has reduced, but other things are now causing us issues which I am trying to identify.
Shall take a look a Glenn's DMV to see if it's any later than the bits I've already pinched from the very useful armoury.
Any further tips welcomed.
Stephen
March 6, 2017 at 3:54 am
Stephen
The only other thing I would say is that you need to explain to whoever owns the server that if they want you to manage it, you need proper access to it. If they're stopping you from doing stuff like capturing wait stats then you're going to have no control when performance deteriorates even furher.
John
March 6, 2017 at 3:56 am
I think they are all cumulative.
They are. Cumulative since SQL started (most people don't clear wait stats).
Can you get permission to add a job? Or ask whoever has access to run a query extracting wait stats every half an hour and send you the results?
Because it's really hard to say anything useful about stats aggregated over days or weeks. They're probably skewed by maintenance jobs, and hence hard to interpret.
If you want to use the waits to diagnose why the server is slow at particular times, then you need to see what waits were accumulated when the server was slow
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
March 6, 2017 at 6:52 am
GilaMonster - Monday, March 6, 2017 3:56 AMI think they are all cumulative.
They are. Cumulative since SQL started (most people don't clear wait stats).
Can you get permission to add a job? Or ask whoever has access to run a query extracting wait stats every half an hour and send you the results?
Because it's really hard to say anything useful about stats aggregated over days or weeks. They're probably skewed by maintenance jobs, and hence hard to interpret.If you want to use the waits to diagnose why the server is slow at particular times, then you need to see what waits were accumulated when the server was slow
Spot on as usual Gail.
Differential wait stats analysis (and same for file IO stalls) are bread-and-butter tuning mechanisms. And you need shorter intervals to be helpful. Also, I very frequently run these two with just 3 or 5 minutes of delay for the interval, captured during a period that I know is experiencing performance problems.
Have you also looked for top poorly performing queries? Or how many rows they are trying to spool back to the client? Or reasons OUTSIDE of SQL Server why the application could be slow?
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
March 6, 2017 at 7:12 am
Thanks for the replies, every day is a school day.
I am trying to get some wait stats over a period of time starting with a 1 min delay so far for a starter.
As mentioned, I have no access to this server me in the UK and the system in France with no outside country access so only get to hear about performance issues post incident.
I am gathering some stats on queries this afternoon, also, the client application that hooks into our instance is Java based and uses our procs but exec sp_prepexec them?
Thanks again
March 6, 2017 at 7:40 am
stephen.plant2 - Monday, March 6, 2017 7:12 AMThanks for the replies, every day is a school day.I am trying to get some wait stats over a period of time starting with a 1 min delay so far for a starter.
As mentioned, I have no access to this server me in the UK and the system in France with no outside country access so only get to hear about performance issues post incident.
I am gathering some stats on queries this afternoon, also, the client application that hooks into our instance is Java based and uses our procs but exec sp_prepexec them?
Thanks again
All you mentioned is bad, sorry to say. I am EXCEPTIONALLY good at performance tuning SQL Server - finding and making problems go away. But often even I cannot fix things without some forensic information WHILE THE EVENT IS HAPPENING. It is just silly to be reactive and not be able to access the box you are supposed to be tuning. And Java apps are just horrible. I have never, in 20+ years of consulting on SQL Server, come across one that wasn't just awful. In all fairness I don't get called in for routine ops, and thus pretty much all applications I come across suck. π
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
March 6, 2017 at 10:30 pm
Stephen,
Can you provide queries to the application care takers and ask them to run them right after a problem is reported? If so, you can offer them something like sp_BlitzCache from Brent Ozar or a similar query against the cache to get the most expensive queries. If that data is gathered within a few minutes of the reported problem, then you can work on the problem.
March 7, 2017 at 7:56 am
lmarkum - Monday, March 6, 2017 10:30 PMStephen,Can you provide queries to the application care takers and ask them to run them right after a problem is reported? If so, you can offer them something like sp_BlitzCache from Brent Ozar or a similar query against the cache to get the most expensive queries. If that data is gathered within a few minutes of the reported problem, then you can work on the problem.
It is quite a difficult task to undertake to get our consortium partners to act on this. They definitely would not nor would we run a standard Blitz script on our prod instances.
We are investigating a method to give our first line support based in country to access a more real time analysis method now.
Thanks for all the replies.
March 7, 2017 at 8:09 am
Middle and senior IT management at your company needs to be taken out behind the wood shed and soundly thrashed. Their decisions and failure to act are a root cause of what is going on.
In any case, good luck with it! π
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply