Disaster recovery

  • I've heard people talking about disaster recovery drill. Just curious to know what really happens in a disaster recovery drill?

  • You go through the process of rebuilding a server as in if a tornado (or whatever you are simulating) destroyed a critital peice of something.

    You stop when the new server is online and working.

    take notes of what worked, what bit you in the arse and how long it took.

    Repeat untill process is optimal and can be done in your sleep... or even better by the most junior DBA.

  • It really depends on the size of the company and your disaster recovery plan. My former employer was rather large. We had an offsite facility, several hundred miles away that was a our DR site. Once a year, several people from IT would troop down there and rebuild a set of servers from scratch and backups until they were online and working. It took anywhere from one to four days depending on how many things broke.

    But you can start small. Practice recovering your databases.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • The 'it depends' comes into play. For critical systems BEFORE we put them into production we go through a disaster test. We have another data center for disaster stuff and will pick a day and pretend there is a disaster BEFORE we get approval to put the new system in production. We get the alternate system up and running at the disaster site and make sure the user base signs off on all pieces and parts work. This is required before we are allowed to put it into production.

    Other systems are a from time to time we do a test of getting the system functional at the disaster site but don't cut over live users to it but have some functional people test out the alternate system to make sure it is functional. For SQL Server systems we tend to use dns alias names for all database servers so we can have an alternate system up and running all connected. Then... in a disaster senario we could re-point the alias to the disaster site.

    Also, from time to time I do a test myself of restoring backups of the production databases to the disaster site SQL Servers that are already there as a test to make sure backups are good.

    A lot of our application servers are VM's so they can quickly get disaster sites up and running using a VM image.

  • We did a few things at various companies.

    At one, once a year we sent a team offsite (random admins/DBAs/Devs) and gave them backup tapes, media for software, and a blank set of servers. Were responsible for recovering our main system there in a day. If it failed, we reviewed the next day and then rescheduled a month later.

    At another, we would have someone create a disaster once a quarter. We would use a QA or Dev server, w/ a copy of an application (random choice) and someone would simulate a disaster. pull a drive, delete data, etc. We had to recover it.

    Practice restoring on a regular basis. This can be your normal "restore to dev" practice, but randomly test different ways to do it yourself. Make sure you have the skills to recover to a point in time, recover an object, etc.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply