I bet you've all had what I call a mini disaster, something that entirely
disrupts production for a minute to a day or so. Sometimes it's user error, a
bad patch, user error, hardware failure, or user error! These mini's are times
that will often test your character. As much as I hate to admit the ones that
occur, I think it's good to talk about them, maybe it will save someone else
some pain someday.
Some percentage of you have a disaster recovery (DR) plan designed to sustain
the business if really bad things happen such as the data center being destroyed
by a storm. I'd say they are slightly better than nice to have and some are
better than others, but do they cover the smaller things that might happen? Or
should they?
On to the disaster!
A little background first. Few server manufacturers talk to you about
planning for the additional heat load and electrical demand that adding new
hardware requires. Adding electrical might be easy if you have the capacity,
gets expensive if they have to run a new supply or put in a dedicated panel.
Same for AC. Adding a small 1U box probably won't affect the AC, but adding a 30
disk SAN running on 220V will make a huge difference. Adding more AC is not
simple, or cheap. Something to think about.
Losing electric can ruin your day. I'll go out on a limb and bet that most
servers are on some type of UPS, although it's rare to have the ability to hold
more than 20 minutes. Far fewer sites have a backup generator. Let's say the
building loses power, UPS kicks in, you're still taking web hits, whatever your
business does. Is the AC on a UPS? Not likely! What happens to the temperature
in the server room? It starts climbing, though if you only have 20 minutes of
power it usually won't get hot enough to matter. Most servers have some type of
thermal protection that will shut them down before they self destruct. I don't
know what the actual setting is internally, but our stuff seems to shut down
when the ambient temperature is between 100 and 110 degrees.
Wondering how I know?
About every six months our building maintenance people shut down the AC in
the entire building for part of Saturday to do preventative maintenance. On more
than one occasion everything has restarted correctly except our server room
units - on a separate circuit or something like that, to make it less likely to
fail! On another occasion they turned it off to service the fire sprinklers in
the building. Usually we catch it, but at least once I've opened the door to
find the temperature over 100 degrees. Typically servers start to shut down, but
it's a guess as to which ones will and which ones have better cooling and will
keep running. Once we had a server that lost a processor, but once the room
cooled we had the BIOS rescan and everything was fine. A tribute to the
hardware, but not our finest hour.
We try very hard to make sure we check after any power outage or building
maintenance, but just in case we installed a temperature alarm connected to an
analog phone line that will notify several people if the temp exceeds 85 degrees
(room is usually cooled to 72). I don't have the brand handy, but it cost us
about $250. Cheap insurance. Make sure you plug it into a UPS! Some models also
detect water, could be a useful thing depending on the environment you're in. In
larger buildings the maintenance staff usually get paged if a system fails, but
it never seems to work for us. In the most recent iteration of this particular
disaster something happened where they did maintenance without telling us, the
AC didn't reset, AND the alarm didn't go off!
So the quiz for today, what do you do if you lose AC right now but still have
electric? How long can you continue to operate? Who do you call?
Your first priority is to start moving the heat out. Keep a couple cheap fans
in or near the server room, don't let people appropriate them. Get the server
room door open, get the fans on high.
Next get on the phone or get someone else on the phone finding whoever works
on your AC. Make sure they understand it's an urgent problem. Even better, prep
them ahead of time about the importance of responding quickly should this
problem occur.
From there I recommend you take steps to reduce the heat load as much as
possible. Consider shutting down the inactive portion of any clusters. Turn off
non critical servers, candidates might be print servers, secondary domain
controllers, file servers, switches, etc. Nice to know ahead of time what is
critical and what is not. You may even need a hierarchy where you turn some off
immediately, turn off others as the temperature continues to rise.
If you've got a spot cooler (a portable AC unit), get it plugged in and
working. If you don't have one, get on the phone and rent one, they start at
about $100 a day. Make sure to know what type of electrical is required, these
often use 220V, and even check the type of plug they require. Usually they are
not big enough to manage all the heat load which is why you need to do the steps
above first or concurrently. Most of these coolers also have a condensation
bucket that has to be emptied when full or the unit will shut down. The last one
we used had a 2.5 gallon container, estimated to be good for 8 hours in Florida
with high humidity, longer if you have partial AC that is reducing the humidity
already.
At this point you've done most of what can be done. Now assess how long you
can maintain operations. Possibly you can keep critical items running
indefinitely. Possibly you've only slowed the heat load and at some point it
will be better to shut things down gracefully rather than risk losing hardware.
If you've got a failover site, think about getting it activated. You also have
to look at the time of day, maybe you can shut down more stuff after 5pm. It
this is happening during the hottest part of the day you may gain a little
ground going into the evening hours.
It can happen. Wednesday afternoon the temp alarm went off, we confirmed the
temperature and that the AC was running. Turns out that at some point an
electrical contractor had run quite a bit of wiring over one of our ducts, it
finally crushed the duct, reducing capacity just enough to let the temp climb.
Luckily it was something we could fix and the temperature never got into the
danger range.
There are lots of things that can go wrong of course, but maybe this will get
you thinking about a little extra preparation if this one ever happens to you.