December 16, 2008 at 8:14 pm
I apologize in advance for the store, but it is the only way I can tell what happened.
On Thursday of last week, our largest customer was running some stress test for a new infomercial they are getting ready to run. Nothing strenuous 5 concurrent create customer/create orders for 50 new customers. During this run, they returned approximately 25% deadlocks. 🙁 Now, as my company really really really likes to nest procs (12 deep in some places), I decide that I am going to run a profiler trace.
So I setup a trace with the following events
DeadLock Graph
DeadLock Chain
RPC: Completed
SP:Completed
SQL Batch: Starting
SQL Batch:Completed
My columns are pretty standard
DBName, DBID, EventSequence, TextData, Duration, Start/Endtime, ObjectName, OjbectID, ObjectID2, and TransactionID (I might be forgetting one, but it was last week)
Of course I setup the trace on another machine and had it write to a file on my personal computer so there was as little performance hit as possible on the server.
Had the customer run the process again and BOOM, deadlocks. Trace captures it, all is fine and dandy. Followed the chain and figured out what is going on. Get about 15K rows returned, which seemed like a lot, but I got the data I wanted and who was I to complain, I found the issue and fixed it.
Of course, just to verify, I setup the same trace and have the customer re-run the process. Good news, no more deadlock issues. Bad new. We are now getting timeout issues. Approximately same number of rows returned. Time to start optimizing. Dropped queries that had durations of 120-130 seconds and dropped them to below 30 seconds (in answer to your question, now we don't have baselines, which I have been arguing for. Now I have a little more in my pocket).
Skip forward a couple of days to last night. My manager restores the database from the 32 bit machine it was on and sticks it on a 64 bit machine just for testing. Of course everything goes through great on the one run they did last night. Switches them back to the 32 bit machine and everything runs perfect as well. No timeouts, no deadlocks, longest duration is 10-15 seconds. Sweet, progress. Maybe it is just because production database (which is located on thee same server as the staging database 🙁 ) had nothing going through it. At this point they say they are going to do some more testing on the old machine, I get called to setup a trace and see if we can figure out why things are going bad. I set it up and go watch some TV at 8p.m. 11 p.m. rolls around and BOOM, start getting script errors out the rear end, and have to restart the instance. Call it a night
Today, start testing on the 64 bit. Setup the trace and everything that ran the night before failed due to timeouts. WTF! We finally realize it is failing due to the trace. Cut the trace down to just SP:Completed and RPC:Completed and everything runs like a street louger on a strait downhill slop.
Now, my questions:
1) why is it the trace caused the timeout issues on both the 32 and 64 bit servers. I know that a profiler trace can cause some performance degradation, but I have never seen/heard of anything this bad
2) when I setup the trace file last night and restart the server I checked the files on my harddrive and had almost 400mb worth of files from about 3 hours of actually trace. While I can see how this would slow down the network, what I am trying to determine is why it would shutdown sql server. Specifically, what happens when a profiler trace is getting so much information that it bottle necks on the side of the server. Does the data get stored into a temporary trace file in the instance, or what.
3) Could doing this, what could have caused the script errors that eventually lead to the database shutting down and needing to be restarted?
Let me know if there is anything that you need clarification on.
By the way, without running a trace, they were able to run 15 concurrent create customer/orders for 1000 orders with everything running less than 2 seconds.
Thanks,
Fraggle.
December 16, 2008 at 11:54 pm
Fraggle (12/16/2008)
Now, my questions:1) why is it the trace caused the timeout issues on both the 32 and 64 bit servers. I know that a profiler trace can cause some performance degradation, but I have never seen/heard of anything this bad
Because using the profiler GUI across a network is the second slowest way to trace a server. If the server's already having problems, it's going to make matters much worse.
You're not the first to kill a production server with profiler.
For deadlocks use traceflag 1222 to get the deadlock graph in the error log, no trace necessary. If you're tracing a busy production server, use a server-side trace rather than the profiler GUI, and write the output to a fast drive on the server that doesn't have the data or log files.
You can use profiler to generate the script for the trace, then just run it through management studio after making the required changed
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 17, 2008 at 6:12 am
To add to what Gail says, I'd also suggest, that when you're not troubleshooting deadlocks, capturing the Start command is unnecessary. Usually getting RPCComplete and SQLBatchComplete covers just about everything you can possibly need. I also didn't hear you mention filtering. You should make sure that you filter out as much as you can too, so that you only capture the events you are interested in. You can capture wait states in performance monitor to see if external processes are causing slow down in the queries you're interested in and expand the pursuit then, as necessary. Otherwise, keep the data collection as narrow as possible. Once you've done that, you should be able to run a trace on production 24/7 with almost no impact. We do it all the time, just as Gail described.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2008 at 7:20 am
Thank you both for the reply.
Gail, are there any articles you would recommend on using server side traces?
Grant, you are correct. I did fail to mention it. I had the filter on for only one of the three databasese on the instance. Also, does taking out the batchstart really make that much of a difference? Is it similar to running an execution plan on a query in that it just takes longer to execute the batch due to having to generate the batchstarting line?
Lasty, can ether of you explain or point me to a more tehcinal understanding of traces in regards to what type od resourses they use and the steps that are involved when you start outputting to much information that it starts slowing down the server?
Thanks for all the help.
Fraggle
December 17, 2008 at 7:33 am
Fraggle (12/17/2008)
Thank you both for the reply.Gail, are there any articles you would recommend on using server side traces?
Grant, you are correct. I did fail to mention it. I had the filter on for only one of the three databasese on the instance. Also, does taking out the batchstart really make that much of a difference? Is it similar to running an execution plan on a query in that it just takes longer to execute the batch due to having to generate the batchstarting line?
Lasty, can ether of you explain or point me to a more tehcinal understanding of traces in regards to what type od resourses they use and the steps that are involved when you start outputting to much information that it starts slowing down the server?
Thanks for all the help.
Fraggle
Reducing the number of events collected reduces the overhead. No, it's not necessarily going to make all the difference in the world, but reducing the load is reducing the load.
But no, the trace events don't, in and of the fact that you're capturing the trace event, cause execution to slow. Rather, especially because you were using the GUI, the act of processing the trace events slowed down the server, which slowed down the queries. I think you could do the exact trace you outlined as a server side trace as Gail outlined and not even notice that it was running. I'm just a bit conservative when it comes to production systems.
As to a precise explanation of everything involved in the trace... I'm going to defer, by and large to Gail. She's much more knowledgable about the lower level workings than I am.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2008 at 8:26 am
Fraggle (12/17/2008)
Gail, are there any articles you would recommend on using server side traces?
Beyond Books Online, I'm not aware of any offhand. Google may be able to find tou some.
Lasty, can ether of you explain or point me to a more tehcinal understanding of traces in regards to what type od resourses they use and the steps that are involved when you start outputting to much information that it starts slowing down the server?
Traces is one of the areas of the internals I don't understand all that well. Hence this is going to be a little vague and may be inaccurate in places.
When there's a trace running, every event has to be checked to see if it needs to be added to the trace buffer. That's not a huge overhead, on SQL 2005 there's a trace running permanently, so it's not a worry.
If the event does have to be traced, the details get added to the trace buffer. If that buffer is full then the event has to wait for space. Space in the buffer is freed either when it gets to the trace destination or, if it's allowed to, when SQL discards the contents of the buffer because things are getting too far behind.
Aside, that's what the checkbox in SQL profiler ("Server processes trace data") is for. If that's checked, the server's not allowed to drop events.
As you can imagine, if the destination for that buffer is a high-speed one (like a fast local drive) then there's no worries. If, however the destination is SQL profiler, then every event has to be sent across the network and acknowledged before it can be discarded from the buffer. Or, for the absolute worst way to trace - SQL profiler writing to a table, every event has to be sent across the network, then written into a (possibly remote) database and then acknowledged.
Hence the general recommendation - profiler for dev/test servers, server-side trace for production servers.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 17, 2008 at 8:39 am
For the very little it's worth, that's pretty much in line with my understanding of the process.
The only point I'm not sure about is the buffer when you're using a server side trace. It was my understanding (and I couldn't find what it was based on if you paid me) that when running a server side trace to file, you don't have to worry about the buffer rejecting transactions. Now, was that because someone decided that just because you're running to file performance will be OK or because it somehow operates differently? I don't know.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2008 at 8:46 am
GilaMonster (12/17/2008)
Fraggle (12/17/2008)
Gail, are there any articles you would recommend on using server side traces?Beyond Books Online, I'm not aware of any offhand. Google may be able to find tou some.
Lasty, can ether of you explain or point me to a more tehcinal understanding of traces in regards to what type od resourses they use and the steps that are involved when you start outputting to much information that it starts slowing down the server?
Traces is one of the areas of the internals I don't understand all that well. Hence this is going to be a little vague and may be inaccurate in places.
When there's a trace running, every event has to be checked to see if it needs to be added to the trace buffer. That's not a huge overhead, on SQL 2005 there's a trace running permanently, so it's not a worry.
If the event does have to be traced, the details get added to the trace buffer. If that buffer is full then the event has to wait for space. Space in the buffer is freed either when it gets to the trace destination or, if it's allowed to, when SQL discards the contents of the buffer because things are getting too far behind.
Aside, that's what the checkbox in SQL profiler ("Server processes trace data") is for. If that's checked, the server's not allowed to drop events.
As you can imagine, if the destination for that buffer is a high-speed one (like a fast local drive) then there's no worries. If, however the destination is SQL profiler, then every event has to be sent across the network and acknowledged before it can be discarded from the buffer. Or, for the absolute worst way to trace - SQL profiler writing to a table, every event has to be sent across the network, then written into a (possibly remote) database and then acknowledged.
Hence the general recommendation - profiler for dev/test servers, server-side trace for production servers.
Just a note on the "Server processes trace data" checkbox. This actually adds to the load on the server because it starts a second trace (Server side) in addition to the Profiler GUI trace.
Inside SQL Server 2005 Performance Tuning and Query Optimization has an excellent chapter on Profiler/Trace. I believe that chapter was written by Adam Machanic.
Also there is some evidence that many filters, can degrade server performace as well. Linchi Shea has a blog post where he did perfomance testing on GUI vs. server-side and there is a comment on it that mentions this.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
December 17, 2008 at 8:51 am
Grant Fritchey (12/17/2008)
It was my understanding (and I couldn't find what it was based on if you paid me) that when running a server side trace to file, you don't have to worry about the buffer rejecting transactions. Now, was that because someone decided that just because you're running to file performance will be OK or because it somehow operates differently? I don't know.
Off hand, no idea.
Jack Corbett (12/17/2008)
Inside SQL Server 2005 Performance Tuning and Query Optimization has an excellent chapter on Profiler/Trace. I believe that chapter was written by Adam Machanic.
Excellent. I haven't got that far through the book yet.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
December 17, 2008 at 8:59 am
GilaMonster (12/17/2008)
Grant Fritchey (12/17/2008)
It was my understanding (and I couldn't find what it was based on if you paid me) that when running a server side trace to file, you don't have to worry about the buffer rejecting transactions. Now, was that because someone decided that just because you're running to file performance will be OK or because it somehow operates differently? I don't know.Off hand, no idea.
I don't know either. I just know what BOL states. There are 2 providers the RowSet provider which SMO/Profiler uses and the File provider which is used on the server-side and apparently is more efficient. I don't remember where I read this, but the File provider guarantees no loss by queueing threads that are writing to the file, so it has the trace queue in SQL Server and then the queue of threads writing to the file, while the Rowset provider only had the trace queue.
Oh and I am working on some articles on Profiler and Tracing. Probably 3 with the 2nd or 3rd one being on server-side traces.
Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question
December 17, 2008 at 9:03 am
Jack Corbett (12/17/2008)
GilaMonster (12/17/2008)
Grant Fritchey (12/17/2008)
It was my understanding (and I couldn't find what it was based on if you paid me) that when running a server side trace to file, you don't have to worry about the buffer rejecting transactions. Now, was that because someone decided that just because you're running to file performance will be OK or because it somehow operates differently? I don't know.Off hand, no idea.
I don't know either. I just know what BOL states. There are 2 providers the RowSet provider which SMO/Profiler uses and the File provider which is used on the server-side and apparently is more efficient. I don't remember where I read this, but the File provider guarantees no loss by queueing threads that are writing to the file, so it has the trace queue in SQL Server and then the queue of threads writing to the file, while the Rowset provider only had the trace queue.
Oh and I am working on some articles on Profiler and Tracing. Probably 3 with the 2nd or 3rd one being on server-side traces.
And there it is. If I'd taken the trouble to read BOL... There's a File provider that is different in construct. I knew I'd read that somewhere. Cool. Can't wait for the articles.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 17, 2008 at 9:32 pm
Grant, Gail, and Jack,
Thank you all for the great explanations. It was great to have some type of explanation to give my boss when he asked about it.
Jack, I really look forward to the articles as we do this frequently.
Fraggle
December 18, 2008 at 5:43 am
I was just reading up on this again last night. According to the BOL, there is a queue and then the SMO or FileReader takes stuff from the queue. I know the queue can be overloaded, so I assume it's up to the reading service, SMO or file, to make sure it pulls fast enough to prevent that. From where I sit, there's still a bit more to learn here.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 18, 2008 at 6:23 am
I was doing a bunch of research on this because it was making me crazy that I didn't know the answer. I finally posted, what I think is the answer, on my blog[/url]. I sure hope that's right because I'm tired of reading about trace & Profiler at this point.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
December 18, 2008 at 6:34 am
Grant Fritchey (12/18/2008)
From where I sit, there's still a bit more to learn here.
I think this is one of the areas of the internals where the more info you get, the more questions you have.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 15 posts - 1 through 15 (of 21 total)
You must be logged in to reply to this topic. Login to reply