Our Major Outage and Recovery

  • We had our primary production database become corrupt a few weeks ago. Here's how it happened.

    1. I got an alert email that backups were not occurring 2pm on a Friday. ( not my choice but we had been using Commvault for all backups ) My suspicion is that network card binding order and/or firewall changes were made which blocked commvault access to perform backups.

    2. After many hours trying to get our systems team to diagnose things, they failed the sql cluster over to the passive node the next morning ( in order to check the active node ). This occurred during reindexing ( no they didn't ask me if anything was running ).

    3. The database came back up in suspect mode.

    4. Senior DBAs and even Microsoft Support was brought in but there was no good way to fix the database so a restore was started, from Commvault backups. Ultimately we could only restore up to 6am Friday morning -- never really understood why we couldn't get back to 1pm Friday when log backups stopped.

    5. A likely cause was that even though the various mapped drives from sql cluster to Netapp SAN were cluster/sql dependencies, the mount points under one of those mapped drives on which several data files resided were not dependencies.

    Many restarts/failovers had occurred before without the mount points being explicit dependencies, so I'm not sure about this hypothesis. We do reindexing online -- sql 2005 enterprise 64-bit.

    Now that the Senior DBAs have fired Commvault ( it still backups the native sql backups ), I'm wondering about their current process that runs the full and differential backups into separate folders named by the day of the week -- the backups themselves always get the same name with no date or timestamp in the name. When the full backup runs on Sunday evening with "init" what happens if that process fails since its overwriting the only previous full backup on disk?

  • Then you have no good backup.



    Shamless self promotion - read my blog http://sirsql.net

  • Well we do have the backup of the backups on commvault, but this seems like a flawed job design where if the new full backup process fails, neither the new one nor the old one being overwritten are usable.

    I've always used maintenance plans which write out each backup with timestamp in the name and a cleanup task set to remove any backup ( trn, dif or bak ) over a certain number of days old. However, the hand-written jobs we have now write to multiple BAK files at the same time which is an improvement in speed and IO.

    A cleanup task could still be used with this current arrangement, but would require more disk space to be allocated.

  • It would nice to get more feedback on this.

  • Yes it's flawed. NEVER delete a backup unless you ar 200% certain you have a way to recover according to your RTO & RPO.

    I have the luxury to be able to do a full restore + checkdb of all my backups (including logs!).

    Never leave home without it!

  • best practice is to plan the backup strategy and retaintion period so that data can be recovered till what extend in the contigency situation and this has to be cleared with business.

  • If during an 8-hour full backup the job fails, leaving neither the new backup nor the previous one with the same file names usable, then you have to write out the new BAK files with different names and only delete the old backup files after the new one is complete.

    Note I said "if" --- I still haven't seen anything in Microsoft documentation dealing specifically with this. Is it possible that when the new backup job fails, it leaves the old files intact? Unless I see more evidence one way or the other I'll have to test this by killing the backup job in a test area and see what I'm left with.

  • My guess if that you're done for it. Let me know once you've tested.

  • Yes, just tested this by attempting to do a backup of a database onto an existing BAK file of about 20GB. After 20 seconds I killed the backup command and it left an unusable 2kb BAK file on disk.

  • Sounds about right, couldn't think of anything other possible outcome since the overwrite has to start right near the begining.

    That's the exact reason why I datetimestamp my files and delete only AFTER I've confirmed the new backup is valid. I always have 3 fulls after the delete is done, and I keep enough logs to go back months. I have this massive luxury because our erp db is real small (25 GB).

  • Ninja's_RGR'us (1/17/2012)


    Sounds about right, couldn't think of anything other possible outcome since the overwrite has to start right near the begining.

    It could only work any other way if the backup process in SQL server took the backup to a temporary file and only overwrote the destination on a successful backup--but that would obviously double storage requirements for the backup, which is why it doesn't do it!

  • Yes so this leaves us with an eight-hour window where if the backup job fails our only good backup is in Commvault. If there has been no outage, you start your full backup again. If there had been an outage you restore from Commvault.

    My view is that we need enough disk space to keep the previous full on disk until the new one is complete. I tell my managers disk space is just a cost of doing business. If you don't want a good disaster recovery situation, or you don't want new business, then don't buy disk space. If you want those things then either buy the space and change your business model ( price your product ) so you can afford it or go open up a hamburger stand.

  • We had a similar situation with one of our clients. But the difference was that the database got corrupted and full backup was performed. Subsequently the last good full backup was deleted and we could not use it to restore.

    So after multiple hours of work we were able to retrieve most of the data. But now I have put a new plan in place.

    First run a Check DB, if successful then take a full backup to a new file with a timestamp. Lastly delete older full backups and log backups.

    I do not use the same maintenance plan to take log backups and then delete older log backups. The log backup job just takes backups.

    The full backup plan will delete full and log backups only if there was a successful full backup and the backup file has been verified.

    http://saveadba.blogspot.com/

    Blog
    http://saveadba.blogspot.com/

  • I also run checkdb before taking the next full backup.

    One more way to save your arse is to NOT delete the backups completely. I have a copy of the full on the first of the month + all the logs since then. Takes a lot longer to restore, but it's better than losing the data.

    A cheap san, or any raid disk array can do that job. Even a cheap external drive.

    If all else fail it might take 48 hours to get back online, but YOU WILL get back online.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply