Taking a Medical Approach to Performance Troubleshooting
I find it difficult sometimes to explain to someone how to have a take a good approach to performance troubleshooting. I had the stomach flu a few weeks ago, and the same day I visited my doctor, there were many people trying to help someone online with SQL performance issues. A few of us offered sound advice for how to diagnose the issue and find the root cause, but the majority of posters offered up “quick fixes” that may or may not be related to his issue in any way.
I realized at this point, that having a good approach to troubleshooting is a lot like the methodologies used by my doctor. Unfortunately, a lot of people take the approach that Hugh Laurie’s character does on the television show House M.D. They offer up potential solutions without knowing what the core problem is hoping that it fixes it. Imagine if you told your doctor that you had a sore throat and he gave you some assorted pills and said to take the blue ones for a day and if you don't feel better, take the red ones the following day. Hopefully, you'd go to a different doctor, one that would figure out what was wrong with you and treat it appropriately. More on that approach later.
Let’s Get Physical – The Baseline
What’s the first thing they do in the doctor’s office when you see a doctor for the first time? Simple, they check your vital signs. They get your weight, height, blood pressure, pulse, temperature, and so on. Then what do they do when you come back on subsequent events? They check it again. Why do they do this?
The doctor is doing a lot more than just trying to justify the high cost of a doctor visit. They are getting a baseline of your health and then they are validating that baseline to see how your health is changing. In many ways, this baseline and checks system may help them discover health issues that you aren’t even aware of yet. When I went in to see me doctor for the stomach flu, my blood pressure was a little higher than normal but not enough to be concerned about. The slight elevation was probably due to my illness. If the next time I go, it is even higher or remains higher without the illness, then the doctor might get concerned. This is why they advise you to do regular physicals.
A regular physical will tell them what my vital signs look like when I am healthy and there are no extenuating circumstances affecting the numbers. Body temperature is a great statistic to use here. The normal temperature for a person is considered to be 98.6 degrees Fahrenheit (37.0 degrees Celsius). In actuality, the normal temperature is usually in the range of 97 to 100 degrees Fahrenheit (36.1 to 37.8 degrees Celsius) according to The Physician’s factbook. So if I’m normally at the high end of the scale, and I come in with a temperature of 100 degrees, would the doctor say that I have a fever or not? Unless he’s done a baseline, he’d have no way to know if I was normal or running high. He’ll look at past visits and physicals, if there were any, to see how I measure up.
You need to do the same thing with your SQL Servers. So many times, all we hear are complaints that something is slow. How do you fix a problem like “slow”? You need to know what’s performing differently when it is slow as compared to how it normally runs. I don’t have enough fingers and toes to count how many things can cause a SQL Server to run slow. Heck, I’d need more hands just to count the things that could make disk I/O slow.
When somebody comes to me for advice saying that their server is running slow, I’ll ask them what has changed. Usually, they start talking about code changes. What I mean is what measurable statistic is different than normal. If someone says that full scans/second is high, I’ll ask what the normal scan rate is. Generally the answer is, “I don’t know.” If you don’t know what the normal scan rate is, how do you know it is high? Generally, when someone asks about a specific counter, it is because they read somewhere that the counter shouldn’t be above a certain level. For example, there are lots of posts out there will tell you that full scans/second should be 0. That would be nice, but it’s not realistic. I’d be alarmed if the scan rate on one of my servers was 0 because that would mean that no users were able to hit the server.
I knew Gregory House, and you sir are no Gregory House
For those of you that have watched the show, how many times have we heard House and his team decide to give the patient the prescribed treatment for something to see if that is what he or she has. Something like, “The test takes 8 hours, and he’ll be dead in 6. Give him the antibiotics, and if he gets better, we’ll know it was auto-immune.”
Now let’s get back to the person online that many of us were trying to help. Lots of people threw in suggestions like “increase the number of tempDB data files” or “make sure the data and log files are on separate drives.” Most of the suggestions did not actually fit the symptoms he described at all. Basically, they were saying to give the server some antibiotics and see if it gets better. Using this approach, you might get lucky and hit on something before the server dies. But unless you are the greatest diagnostician in the world, like Gregory House, what are your chances of guessing the right fix without knowing the root problem before the server dies? And by “dies,” I mean before you have a dozen managers standing in your office asking when it will be fixed and reminding you to use the new cover sheets when you send them the root cause analysis (before the end of the day).
Repeat As Necessary
Okay, so I didn’t simply choose full scans/second randomly, we had an issue last year where after a code release, our average full scans/second went from 2000 to 5000. I created a bug, and one of the dev. members on the sustained engineering team worked with me to troubleshoot the issue. In the end, we determined that there was no net performance hit from the increase scan count, and it was largely due to frequent, short queries that looked up values from small static tables. The key point to take away here is that we discovered the change in the system because we knew what the normal scan rate was and we detected that it had changed. We even had a nice pretty graph showing that it suddenly increased after a major release instead of slowly ramping up to the new rate.
So this was not a symptom of a performance problem. It was a long-term change in the health stats of the system. Now I know that it is normal and healthy for the system to have a scan rate of 5000 full scans/second. A considerable drop or increase in the scan rate would be a red flag for me, especially if there was not an extenuating circumstance involved like a code release. This series of validating baselines has led us to find lots of code issues before they are noticed by end users. We have brought our average CPU utilization from 65-70% to an average of 15 to 20%.
So just like the doctor does, you should regularly re-run your baseline to look for changes. If something has changed, up or down, then you need to determine why. You need to determine if it is something that needs to be fixed or just a general change in the healthy state of your system.
What are Your Symptoms?
So what does your doctor do next after checking your vitals? He wants to know your symptoms. What hurts? Unfortunately, your SQL Server can’t come right out and tell you that its I/O is slow or that its tempDB allocation files are “cramping.” You have to poke and prod SQL to find the symptoms because throwing “treatments” at a server randomly is reckless and may even mask the real problem and make it appear to be fixed when it’s only been relieved for the time being.
That doesn’t mean that I’ll never do something to temporarily relieve a problem. For example, the online person we were trying to help said that performance works fine for a few weeks and then it starts to run slowly until they reboot it. So rebooting the server temporarily fixes the real problem. If I came into this situation and it is a critical production server that is falling completely over, I’d reboot it in a heartbeat. I’d also know that I’ve got 2 or 3 weeks to get to the root of the problem. This means that I’d be polling DMV’s, running traces, and setting up alerts to make sure I notice when certain events occur.
WWMDD?
I could sum up this approach as WWMDD? Or What Would My Doctor Do? Your doctor would maintain a baseline of your vital statistics and when needed, run diagnostic tests to find the root problem and prescribe treatment specific to the problem. Don't throw random potential fixes at your SQL Server. Do the footwork and know your server. When a problem arises, diagnose the root problem, and treat it appropriately.