Fire Drills

  • Comments posted to this topic are about the item Fire Drills

  • Good post as usuall.

    I presume it's very different at different companies etc. Where I work we restore several times a month on different servers to test stuff and not to mention we have a job that restores a db each night on a offline server so the restores sure are working nicely. However, we dont use the emergencybackups a lot but it happens on occation, mostly for historical research in a few areas that demands it.

  • Couldn't agree less with the baseball, Steve (being a Brit, baseball doesn't float my boat ;)), but couldn't agree more about periodic restores.

    It's drilled into us all as IT workers that the most important job we have is to safely back up our data, yet it's seen by most, me included, as a tedious job. In my company, we make use of restores quite regularly; the last time I did a production db restore was about a week ago. More importantly, though, we tend to spread the skills around fairly extensively, so that if one of the first line guys want a file restored on a network drive, they get to do it themselves. That's because the last thing you want when things get exciting is for recovery of data to have become a specialist skill.

    There's an awful lot in IT that you can achieve by learning the theory, but DR is, in my opinion, the most glaring exception to this. If something's broken, it's rare you'll be able to restore to an exactly similar setup; you might need to cannibalise another server, lump two different apps onto one box, relocate your hardware, do a selective restore or make any number of other on-the-fly changes to get the business back working again. In short, in a DR situation, the immediate solution will likely be a dirty one, and it requires practical familiarity with the systems you're running and, most importantly, knowledge of what cludges you can get away with safely. And that doesn't come out of a text book. Practice, practice, practice.

    I think I've said before on this forum that this is the area I believe to be the single biggest justification for the salaries we're paid. It's about having a skill you hope you'll never have to use.

    Semper in excretia, suus solum profundum variat

  • I completely agree on the need to test the restore process by starting with getting the tape (or whatever) from offsite storage.

    This tests a lot of thing you need in a real DR...

    1) The tape can be found at the offsite location

    2) The tape can be returned to your DC within the SLA time

    3) The tape's contents can be copied to disk within the SLA times

    4) Your disk backup really got on to the tape

    5) You can do a successful restore from the backup

    6) You can hand over the applications for use within the SLA times

    Often the first few times you try this process, maybe 4 or more out of the above 6 are not achieved. Sometimes the admin processes get changed to meet the SLA times, sometimes the SLA gets changed, and sometimes (sadly) people remain in denial.

    Even when things normally work well, it is still frightening the number of times a tape backup fails and either nothing or an incomplete backup gets sent offsite that day. In Windows land this seems to get accepted. In Mainframe land this was unacceptable 30 years ago so there is no inherent reason why the tape backup process should be anything other than 100% successful 100% of the time. (sorry about the rant)

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • In my experience the fallibility of tape storage systems has more to do with poor manufacturing quality (and it gets worse by the year). The number of times we've been commissioning new servers and found we've had to send back not just one, but two or even three tape drives! And yes, we only use quality brand kit.

    At best I'd rate tape storage reliability at 80%, which is dreadful ... for us, in practice, this means that we need resilient transaction log backup and management practices.

    R

  • I've worked the full spectrum of the computer industry and have seen all kinds of DR schemes. Some locations with nothing more than printouts on paper (back in the mid-80's). I also agree that tapes have too high a failure rate. The best place I worked had lots of extra ca$h and were able to purchase optical storage, which I never saw fail, but I think everything can fail given enough time. When I worked at a nuclear power plant, the NRC required a test every 18 months where we restored to an offsite location and got our systems running remotely. When I worked at an American textile plant, money was tight. We purchased one tape drive to backup our SQL server (it was also our user data server and application server). We were able to cheaply duplicate the hardware (without the tape drive because that doubled the cost). Each server was RAID-5 and then we essentially mirrored the primary to the secondary using transactions. The idea was if one server was lost, we would not lose more than an hour of data and we could move our users to that server. The plant could not run for more than 32 hours without the applications. We ran a few tests, but inevitably, NAFTA, 9-1-1 and the economy caused the plant to go under. It seemed we had good plans, but the plant got auctioned off and most of it ended up in India.

    So, I think each and every case for disaster recovery should be evaluated for it's needs. The big thing is to use common sense in planning before trying to throw hardware and backup plans at the issue. As in most computer applications I've worked on, planning (including test planning) should take 65% of the project time. The implementation phase (hardware, software, and setting up the backups to plans) should take around 10% of the time. I think 20% should be running the test plan and correcting things that were unforeseen. Lastly, there needs to be a procedure for future growth. I usually allot 5% of the project time for this. It needs to account for growth of the respective company and the needs it has for disaster recovery. Also, a regular testing program should be implemented (yearly, monthly, weekly, etc.)

    -- Al

  • The fun part is - this kind of test can be fully automated, so that you can run an in-depth test of restoring your backup, and in-depth validation testing. Between an MSBUILD project, DataDude, CruiseControl, some time building validation tests, and some spare hardware, you can do a lot, and just get notified when it didn't work. Set the goofy thing to fire once a day, even if on a VM you then throw away when you're done.

    And no - this doesn't entirely replace the actual full-blown DR test, but it should go pretty far towards showing you know your recovery plans are solid, and you give enough care to make SURE it's solid.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Its a good practice to refresh the test environment from a back up of production. The disaster recovery procedure gets a thorough review and the test environment reflects the state of the production environment just prior to the back up.

  • Ahh Major, if you get over to Denver in the summer we'll get you out there for a Billy Crystal moment and see how you like baseball 😛

    http://news.bbc.co.uk/2/hi/entertainment/7289502.stm

    http://nbcsports.msnbc.com/id/23566549/

    I agree that you can automate a lot of this, but you have to be careful about how you do this. We used to have some servers restore to each other and then drop the restored databases after being sure we could connect, run a SELECT COUNT(*), and then disconnect. However you have to have the disk space (sometimes hard to come by), you could cause some fragmentation here, so you want lots of spare space. You have to have resources (CPU cycles on that restore machine), and it can't interfere with production. Not everyone can do this.

    If you have a smaller environment, maybe < 20 servers, then I'd hope you could bring back all databases to another spare server, maybe even one at a time, to test the restores. I hate bringing back to the same system since I think it's only a partial test. It really needs to come back from tape, and I'd encourage you to identify the top 2-5 databases and pull them off tape once a month. Get a process setup to do this, even partially automated (reminder pops up, you load the tape, click "go" and the rest is automated, including a note to put the tape away).

    Restoring to a test / dev server brings different issues. You don't want to interrupt development or testing, so you need extra space for a test db. You have to be careful of production data these days, so giving developers access to production data might be a problem. Depends on your situation and environment. Ideally you'd restore to a standby server in case the real one blows up, but you can probably only restore a few databases to those servers.

    It's not a simple thing, but it's something you could test and at least once a year, take a day or two and do a full application restore from the critical servers. You might be surprised what happens.

  • The Major brings up a good point - DR is more than just doing a successful restore & verify to another server - you need to be able to restore the complete environment if the server box goes bad. Things like web services and ODBC connections from the applications expect a specific configuration and they'll be offline until it's all put back together.

    I can't understand why tape is still so popular. Regardless of technology advances, it is still a mechanical device with moving parts that will eventually fail. Sometimes sooner rather than later.

  • Well, we used to restore all of our main databases nightly, but that hasn't been done in a while. Sadly, we sometimes need to restore one of our databases to another system to recover data lost through user error. Our code currently issues delete commands against the database for an awful lot of data that is "no longer needed" or so our users think. They sometimes don't realize the full impact of that decision until they can't find some of that data at a later point. Times like that I really wish we were using Litespeed because it offers table-level restores. I rarely need the whole database, but that's my only choice with every other tool.

    Going forward, we're building in some more options in our code to soft-delete records (i.e. hide them). I hope that will also give us a little more freedom to start those nightly restores and DBCC's against the restored database. They were quite useful for several things, including testing our ability to restore.

    That being said, I've only once had a problem restoring a SQL Database from file (not tape) and that was because the restore was incomplete and failed due to running out of disk space. It was a great way to create a majorly corrupted SQL 6.5 database, though. 😀 I've definitely had issues from tape, but happily don't need those files as often. I think the worst was either SQL 6.5 SP2 or SP4 where the ability to restore from a tape drive had some big issues. They were fixed in a hotfix, but it was painful to the customers who actually needed to restore their data.

  • There is this other thread talking about a backup without indexes. Also a compressed backup that would be just the data (logs all committed) and scripts for everything else would be nice. Too many times I have problems restoring customer data do to disk space only to find that he customer is not doing good maintenance and the unused log space is bigger than my whole laptop drive.

    Regarding baseball, I've used the hitter analogy a bunch when talking to sales types. A thousand calls gets 300 appointments. Three hundred appointments yeilds 100 proposals. One hundred proposals is 30 engagements. The commission on 30 engagements should let you pick out whatever kind of car makes you happy even for somebody like Jeremy Clarkson (wink to the Brits).

    On our side of the industry .300 would be terrible. In order shipping if I only got 30% of the orders right I would not be here long. We talk about five nines. In the volume of transactions, and the value thereof, even that is not near good enough.

    ATBCharles Kincaid

  • I do test restores at least weekly. Usually more often than that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply