Back in July 2002 I wrote Disaster
In The Real World - #2, my real life adventure about a server going down.
Now about six months later, I almost got to relive the whole thing again! Read
on to hear about things can go from bad to worse to ok back to bad again.
Thursday afternoon I was sick with a cold, terrible headache, called it a day
about 3pm. Two minutes from home when Sean
calls my cell phone, the data drive on the reporting server is missing. Turn
around and go back to the office, check in with our network dude (ND) to see
what is going on. Drive dropped for no apparent reason, server itself is ok,
checks status of SCSI containers. You know the one that usually says 'OK'? Now
it says 'DEAD'. Great. Another massive restore that will take most of the night.
Knowing that we can do the restore in time to be up the next day, we decide to
work the problem for a while, the difference this time being we can see the
container even though its dead, last time container was just gone.
Three hours later, still on the phone. After various bits of voodoo,
including a time when it stopped even booting to the OS on C, they get the
container to change status to 'Scrubbing', decide we need to replace the
controller card. I go home to get some sleep while we wait on the new card. 11pm
our ND calls me, has the card, will be morning before the drive scrub is
complete, he will replace the card then. Sounds better than staying up all
night, so we decide to regroup at the office at 5 am.
Next morning we replace the card, drive goes back to scrubbing again. Not
what we planned. Boot to windows since we're out of time, performance starts out
bad and gets worse. No utility installed so we can check status of scrub or
change the priority. Reporting, time and attendance stuff, one fairly critical
app reside on it, plus the db where we log both errors and information stuff
from our main app. By 9:30 they are screaming, our main app is crawling. Reason
is that we log both text and screen shot to aid in diagnostics, inserts are
taking 45 seconds or longer. After a couple mins thought, decide to change the
password on the error logging login. The component is built so that if the
connection fails, it write the data out locally, tries to post it the next time
the app starts up. That gives us a pretty good boost. Now everyone wants to know
when the other applications will be usable, we tell them 5pm to err on the worst
case side, we're figuring by 1pm we'll be good.
Drive time settles back down to normal about 8:30 am the next day (Sat). I
email that everyone can resume normal operations, turn on some processes I had
disabled. Run a dbcc on everything, no problems, set the backup to run later
that night, log off.
Monday early everything looks ok initially, then we see drive times way up.
Further investigating reveals the backup finished, but took about 5 times as
long as usual. Bad. Our agents go on break around 10 am, we decide to bring the
server down to upgrade the card bios and put the management software on the box
so we can check the status. Everything goes good, we replace the other scsi card
for the tape drive, reboot, get an operating system not found error. Double
check that everything is connected, seated, etc, etc, reboot, same thing. Call
support again. Break lasts 15 mins, we are way over that already. Company
meeting scheduled for 1130, we miss that too, our CIO brings us lunch (great
boss!), we finally get it to boot...by using a floppy! Doesn't fix the problem,
but at least we're back in business, we can finish troubleshooting after 5pm or
on the weekend. This is about 1:30 pm. We set the scrub priority to low, disk
time is cool, time for a break.
So, a happy ending?
Not quite. SQL won't start, says the master can't be recovered. Not what I
want to hear. Thinking wishful thoughts, I get something to drink, give it 10
minutes, try again. Same thing. Ok, I have previous night backup on disk, so I
restore the master. 20 of 200+ db's corrupt, the reason of a bad shutdown with
the caching controller apparently. Almost all of them are replicated copies that
aren't immediately critical and that I can fix by just dropping the db, putting
it back, doing a new snapshot. The other five I have to check on to see how bad
restoring from the previous night will hurt us. Turns out ok, so I restore those
five first, then work on the other ones. By 6pm I've got most everything working
again, performance is still good.
Along the way we decide that one db is too important to risk additional down
time, so I detach and copy to our main server. Because its a replicated copy, I
debate whether to snapshot it across the 256 link to the publisher, or try to
repoint the subscription to the new server. I'm thinking the latter will work
since it uses an alias on the publisher, but I'm short on time and better to be
slow than wrong. Start the snapshot, users continue to hit the db on the
reporting server while the snapshot is in progress. Couple hours later snapshot
finishes, do the switchover to point the web app to the new server.
7 pm, everything back to normal, set another dbcc to run while I head home.
Check in later, no errors reported. Good!
So what did I learn from all this? We had already requested (again) after the
last incident money to support clustering the sql boxes and that had been
approved for next year, nothing to do but wait. Hardware is usually pretty
stable, we were using good stuff (as far as we knew anyway), biggest complaint I
had is that there should be no level 1 techs working server support!
On the application side, we had built fairly well in our error logging so
that if the server was down the apps that used it wouldn't be down as well, had
not envisioned a scenario where we would want to wholesale disable it. We're
considering now how to add that feature and whether to add the option to point
it to an alternate server. Beyond that, it got us thinking about what it would
take to make our databases more portable, where we could easily move them from
one server to another if we needed to. Changing the application is pretty easy,
we could make the change and deploy in 10-15 mins, the problem is that many of
our internal apps have cross database dependencies, so moving one to another
server wouldn't work, we'd have to move them all or set up a linked server and
change some code to make it all work. Not doable in a hurry and not sure it's
doable in a workable/maintainable way at all.
On the SQL side, I need to can the process of restoring the master in case
someone else needs to do it maybe, and I need to experiment with replication to
see if changing the subscriber location can be done without redoing the
snapshot.
All in all, it could have been much worse! Steve
Jones took the lead here in writing about things that go wrong, now I'm
ahead 2 incidents to 1. Nothing to brag about of course, just hoping that
publishing our troubles might help you avoid problems at your place. Comments?
Click the link below!