Most DBA's grow into their jobs and part of that process is the baptism by
fire that occurs when things go wrong, especially on the performance side of
things. I started to make this performance centric, but so many of the steps and
thoughts are common with other problems that I tried to make it a little more
generic. In this article I've listed some tips and thoughts about what to do -
and why - when the sudden performance dip strikes or other serious problem
occurs. It's probably not a perfect list, but the key is to build a process that
keeps you on track.
#1 - It's Usually Blocking
When I get a call that 'everyone is locked up', 'everyone is slow', etc, the
first thing I look for is blocking. Various ways to check, but the tried and
true technique is to execute 'select * from master..sysprocesses where blocked >
0'. Works on SQL 7 forward and takes about 15 seconds to type in and run, then
you can decide what action to take. This usually occurs because someone is doing
something bad like bulk loading records during the day. You can find and kill
the blocking spid, but it's worth looking to see what the spid is doing first.
If it's been running for three hours you might be better off to let it finish
rather than wait for it to roll back.
#2 - It's Not Blocking, Is It Limited to One Application/Database/Physical
Location?
I'm usually asking this question while I'm headed to a machine to work on
step #1. If it's NONE of those things I can probably rule out recent
application deployments, recent proc changes, network problems to a WAN site. If
it is one physical location, I immediately get an email out to the network team
to have them check all their hardware. Sometimes a router will blip, or maybe
they just rebooted it and forgot to tell everyone. In practice it's usually
not the network, but sometimes it is! If it's just one application and/or
database, I'm immediately thinking 'what has changed'? If you know how to do
wait state analysis this is a good technique to think about applying here but it
takes time and you can often shortcut with the right questions.
#3 - If it's One Application, Can They Reproduce the Issue?
This is key. If they can show you it's always slow when they add a customer,
look up an order, etc, you can quickly zoom in by profiling and filtering to a
single host name. That will let you see exactly what is being executed and look
for something with a long duration - long being subjective, but anything close
to 30 seconds is a good candidate for any immediate look and anything on down to
5 seconds might need to be checked. Ideally here you can refer back to a
baseline so you know what those procedures usually cost. Usually application
problems can be traced back to either a change (app code or database), or SQL
has generated a new plan for a query that isn't performing well, or both. In the
worst case get a developer to step through the code against the production
database because it's not always a SQL problem. Check whatever error logging
system the application uses and the event log of the machine they are using it
on.
#4 - It's One Application, But They Cannot Reproduce on Demand
These can be miserable to solve. I usually start with running separate
profiler sessions on a couple users I can trust to call or email when the
problem happens and to be able to tell me a lot about what they were doing at
the time. I've seen some of these solved by the low tech technique of using a
camcorder to capture everything that happened, more recently I've used
Morae to let me go back in time
to see what happened. Both of those are more often needed for code issues rather
than SQL issues, but sometimes you're just looking for clues. If you can get the
users to log when it happens you can try to match up in Profiler, the hard part
is the query plans may be fine but they are being blocked or bogged down by
something happening elsewhere on the server. So you have to have a deep focus on
the user AND be looking at the server holistically.
#5 - Sometimes it's the data
This might happen because a lot of data has been added and the statistics
don't match the current state of the data, or just because in some cases there
is a lot more data than the developer expected. I dealt with a problem like this
once where IT users had not flipped a bit in a table that allowed users to cache
the results of an expensive query on the client, as a result they kept running
that query over and over and over, killing performance. Flipped the bit and
magically all was well!
#6 - Rarely Is it the Hardware
Perhaps most common in this category is a drive in a RAID set failing and no
one noticing, resulting in a performance decrease because parity is being used
to restore the missing data on the fly. I've also seen more than once incidents
where 'the server is down' but someone doing work in the server room
unplugged/knocked loose the wrong power/network cable. For the most part
the hardware will either work or it won't. Doesn't hurt to have some check, but
unless the server is actually unavailable this should be further down the list.
#7 - It's Usually Because Someone Changed Something!
I bet we all have had a few of these happen. Having a good change control
process that is logged can really help when you're under stress. Something as
minor as deploying a proc change where they added a column or order by can have
a huge performance impact if the query is no longer covered. Sometimes the rules
change after a hotfix (less often) or a service pack (more often) and while
hopefully the changes are good, sometimes it isn't. Larger companies may be
applying patches on a regular basis and after a while you start to ignore them,
until something goes wrong. I'll also add a plug here for rebooting one extra
time after applying a service pack because I want to be absolutely sure there
are no pending changes waiting to be made live on the next reboot which might be
months away - who would think of the last service pack as being an issue when it
was applied three months ago?
#8 - It's Incredibly Important to Communicate
As soon as you know an incident is going to last more than few minutes
someone needs to get the word to operations. Work may need to be redirected to
other sites, hourly employees put on break or sent for an hour of required but
not all that important training, or in worse cases sent home for the day. Even
smaller companies may have hundreds of employees/customers affected and the cost
per hour of down time can be significant. Tell operations what you know via your
boss, let them decide. It's also important to immediately communicate problems
to everyone in IT. A lot of times you'll get an immediate reply back from
someone who was bulk loading data, rebooting a switch, etc, that will
immediately explain the problem. You also will need people to help you chase
down leads so that you can stay focused, and someone to make sure you get
something to eat and drink while you work.
#9 - Know when to ask for Help
I think the more experienced you are the easier this is to do. If you're a
brand new DBA you hate to get the credit card from your boss to call PSS, or ask
for a consultant to come, because you don't want to look inept. Absolutely there
is a risk to that, but on the other hand, anything beyond a few hours and the
cost of either of those pales in comparison. Try setting the stage before the
incident by having one call to PSS pre-approved and consider establishing a
relationship with a SQL consultant just in case. And it doesn't have to just be
PSS, lot's of good challenging problems get posted in the forums here and on
MSDN every day and the people that answer them like a good challenge and the
learning it brings. Lean on your peers too, whether within the company or
someone you met at the local user group. With experience you get more
comfortable with the idea that you know what you need to know, you know what you
don't know, and that it is not a sign of weakness to ask for help. Sometimes
easier to say than to do!
#10 - Learn Your Lesson
I bet easily 8 out of 10 times when I'm finally done resolving a performance
problem that I can look back and see that I didn't ask the right question,
didn't really listen to an answer, took off following the first lead rather than
being methodical, or any of a bunch of other things I should have done - in
hindsight. This isn't about beating yourself up or holding yourself to some
incredibly high standard, but rather stepping past our pride enough to really
learn a lesson.
#Extra Credit - The Challenge
It's been a while since one was posted, but Steve and I had a lot of fun
writing about many of the near disasters we've been involved in, often of our
making. So my challenge to you is post your own performance issue checklist, or
tell us about the problem you should have solved sooner and why. Yeah, we may
smile a bit when we read them, but I bet more of us smile because we've done the
same thing, or because we know that's one trap we won't fall in to. As always I
hope you've learned something or sparked an idea or two, and look forward to an
interesting discussion!