Backup and Restore

  • Rainbow Six

    I think that it's important to practice restoring and rebuilding your systems. You never know when you'll encounter problems and need to restore your data onto another piece of hardware. It's definitely something that people don't often practice and with documentation not usually being up to par, it's something that people should be comfortable with performing.

    I saw this article for CIOs that seemed to suggest that IT people aren't testing their systems enough. The statistic in the article was that 89% only test once a year. Also that 67% were only minimally confident that their disaster recovery system would work as planned.

    I think it's pretty good that 89% test their systems. I'm guessing that 11% do it more than once a year, but I'd be concerned if 89% hadn't tested it at all! I think once a year is a pretty good plan given the hectic scheduled, heavy workloads, and the lack of buy-in by many executives for a DR plan. It's not like we're the guys from Rain bow Six that could be called on multiple times to perform this year. In all likelihood most of us will never encounter a disaster in our careers.

    And if you do, it won't go as planned. I've had disasters and disaster tests in my career and none of them goes as planned. It's like I've seen quoted: "No battle plan ever survives contact with the enemy." We could easily paraphrase this for IT and note that no DR plan will every survive a disaster. You need people that can adapt, think in the heat of the moment, and get things working when the plan doesn't cover every contingency.

    As DBAs I think most of us tend to practice the DR stuff on a more regular basis as part of our jobs. Many of us are constantly restoring databases for QA and development, getting practice that sysadmins don't usually get. We also have the advantage of being separated from the hardware and we can install SQL Server and restore databases on any Windows server host.

    I think you should practice your DR skills in little bits. Maybe grab the tail of the log and do a point in time restore for your next QA cycle. But practicing a full plan more than once a year isn't practical for most of us.

  • Steve, for the sake of argument, I'll agree that "No battle plan ever survives contact with the enemy." However, most IT shops do NOT have a valid DR plan to start with. You're completely misinterpreting that 89% number!! The CIO Insight article says, "a whopping 89 percent test their disaster recovery/failover systems only once a year or not at all" -- they don't break out the number for those who don't at all from those who test once a year. [Note: if they're not even testing their failover clusters, that means they're not even applying security hotfixes.]

    I believe those annual testers are "live testing" their plans under fire... Premier Support gets these calls every day. Even the annual testers are NOT accounting for the typically high turnover rate in the industry and even among departments in larger companies. Nor does it deal with the lack of communication between the hardware monkeys and the data monkeys... If your once a year test didn't include testing your SLA or OLA for hardware replacement from your internal monkeys or external vendors, it wasn't a valid test.

    Now back to the subject of The Plan. The PMI mantra that "Having no plan is a plan to fail." is one of the truest maxims in DR. No plan is the current state of the industry and it's truly sad. Apparently the war stories of IT folks who dutifully backup all their data and are then horrified to find at disaster time that all of those untested backups are invalid, broken or corrupted  have done nothing to convince the majority of "senior" dbas to write and PRINT on dead trees a DR plan before the disaster and require the most "junior" dba on the team to implement it quarterly. The cost of not planning business continuity is huge.

    Minimum quarterly. There's no reason that a responsible shop can't automate the restore test for every single backup. Log shipping will automate this test for you and leave you with a warm spare to boot!! I sleep better at night knowing that my phone will ring with an alert from the NOC if one of my log shipping secondaries gets a corrupt log backup delivered... Database mirroring isn't much harder to setup, either. The cost of cluster-certified hardware has come waaay down in the past five years, and rumor has it that self-certified Longhorn clusters will make failover clustering even more economical (not to mention more robust with new quorum technology).

    It doesn't take very much time looking at MTTF numbers for hard drives, RAID controllers, processors, etc, and doing a little mental math about the number of spindles in your array or processor cores in your clusters before you realize that YOU are likely to see a failure on your watch unless you're working in the teensiest shop or in a trivial deployment. Unless you are changing out hardware like most people change their undies, it's going to happen. A failure is not necessarily a disaster, if you're prepared.

    I guess the number of people who still pay the tax on people who are bad at math (a.k.a. the lotto) probably explains why folks are willing to gamble with their company's data and their own jobs...

    Just be sure you've got a backup of your resume...  Heh.

  • At the 2 companies I have worked that had a failover server we tested at least once every 3 months, pluss whenever we had patching of course.  The every 3 months testing was a full scale failover of the entire server room, including testing the battery backups and even shutting that down so we could test bringing the entire company back online to local.  Wasn't fun that's for sure, but even after the 8th time of doing it, we still found weaknesses in our plan.

    I also am known for asking for random tapes from the network group just to test them.  Latest one I did was a tape from off site that was 15 months old

  • I did miss the "not at all", which is bad. You definitely need to test systems, though honestly I'm more concerned about the Windows or application level than the SQL level.

    Here's the reason. SQL is a fairly static environment in most places. We might make schema changes, but the server itself doesn't change that much and it's easy to install SQL, reattach databases, and the restore process is simple.

    It just works and it's the reason I picked backup/restore as my favorite feature. It works.

    The exception to that might be restoring master. Some people definitely need some practice on this one, but once you get that restored, everything else tends to work.

    Now the Windows side could be different and that could be more of an impact to your applications.

    Having a warm server is a great idea, but it's not always possible in practice. For a variety of reasons, though these days with disk space being so cheap, you should at least be pulling your top 1 or 2 databases to another server just in case.

  • From past experience I would say that budget constraints make a proper test hard to carry out in many shops.

    You may be able to proove that a backup can be restored but there is a lot more to DR than simply restoring from a backup.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply