Disaster Recovery is both an exciting challenge for many DBAs and also a dreaded event that many hope they never experience. When a true disaster befalls your system, there is a tremendous amount of stress as we work to get things running again so clients can access data. Even the best laid plans will often have a glitch, and while it's a great technical challenge, most of us would rather keep practicing and gaming possibilities than actually experience a disaster.
The majority of our system disasters are localized to the actual machine handling our workloads. We often don't have outside disasters, like hurricanes, with disruption and damage to other parts of our infrastructure and ever our lives. When that happens on top of a down system, it's a level of stress that can affect our health. I've been lucky in that I've experienced both kinds of disasters, but never at the same time. I hope you can say the same thing.
Let's assume some local disaster defalls your system. Hardware, software, it doesn't matter, but you end up with a corrupted database of some sort. This week I'm wondering if you have an idea of what the recovery time would be? How long before clients are up and running? Maybe more important, how long before you have system rebuilt?
If you have some sort of High Availability (HA) plan, then you might be back up for clients quickly, in minutes or even seconds. The disaster really isn't over, however, since you are now running on less hardware than you planned. Until you can rebuild the downed node, you're still in disaster mode. If you're like most of the organizations that have employed me, you'll also be stressed as your formerly well-designed HA setup is now a single point of failure and you're scrambling to get the main node rebuilt.
We often think of a disaster as the time we're down because of some event. Once clients are being served again, we tend to relax and think the disaster is over. It's not, because until you replace the affected systems and bring them back online, you're even more vulnerable than you were previously to another Murphy's Law incident.