This past week was the same as nearly every other week over the last five years in one respect: I received several emails from people around the world asking for help with corrupt databases, where their recent backups also contained the corruption. The three people who contacted me last week all had backups from before the corruption occurred, but did not want to lose data by restoring back that far in the past.
I think that it is inexcusable to be in the situation where the database and all recent backups are corrupt – it’s almost as irresponsible as having no backups at all. I really do not see how companies these days, who rely so heavily on data, can be so blasé about the integrity of their backups.
Even if the person responsible for dealing with SQL Server at the company is not a DBA, there has been enough publicity over the years about viruses, computers crashing, and other methods of data loss than anyone involved in IT in any capacity simply has to realize that protecting data is important.
However, I can understand how people make the mistake of assuming that, if the backup completes, it is definitely going to restore and it is definitely going to be corruption free. There are two common assumptions that underlie this mistake: The first is that SQL Server will not allow a backup to be created of a corrupt database and the second is that the backup will not get corrupted after it’s been taken.
Unfortunately both of these assumptions are incorrect. Yes, for a database created on SQL Server 2005 and later versions, when no-one has disabled page checksums, SQL Server will detect when the I/O subsystem has caused data file corruption. But backups do not check page checksums by default! You must use the WITH CHECKSUM option to enable this checking (and the generation of a checksum over the entire backup file) – and everyone should be using this option for all backups they perform.
Log backups don’t check your data file integrity (with the exception of a log backup after a minimally-logged operation) so you need to take data backups in order to do that. The more frequently you can take full backups, the more likely it is that your restore sequence will succeed. This is because it will be less likely to be tripped up by corruption in one of the differential or log backups that you’ll need to restore so as to get from the full backup point up to the most recent time. The more backup files that are involved, the more likely one of them will be damaged. You’ll have to balance this with the performance overhead (in terms of extra I/Os, potential for log file growth, and CPU usage if using compression) of performing a full or differential backup, plus the storage cost associated with having more, larger backups around.
I think I’d be safe in stating that most people do not validate the integrity of their backups. The omission of this step is what causes so many people to end up with corruption in the database and in the recent backups too. Without performing a restore, you don’t know for sure whether the backup you just took is valid or not. Even if you used WITH CHECKSUM, the backup file could have been written to a damaged I/O subsystem and is already corrupt. Or it could be damaged at any time by a faulty I/O subsystem. I like to call I/O subsystems ‘equal-opportunity corruptors’ – they’ll happily damage anyone’s file, no matter how important.
It isn’t good enough merely to perform a RESTORE VERIFYONLY, as it really just examines the backup header for validity. You have to use the WITH CHECKSUM option on the RESTORE statement to make it re-check the page checksums of the database pages in the backup and the checksum of the entire backup file.
But even that isn’t good enough. Some data file pages could have been corrupted in memory and then written to disk with a valid page checksum. The only thing that will detect this is DBCC CHECKDB. This means that the only way to know whether the backup is valid and contains a valid database is to restore it somewhere and run DBCC CHECKDB on the restored database. If it succeeds, it tells you that the backup is valid, the database in the backup is valid, and that at the time the backup was taken there was no corruption in the production database.
The problem with this method is that it requires extra storage space to restore the database, which is likely to be a problem with multi-TB databases. Back when I was at Microsoft I invented and patented a method of opening the database in a backup without having to restore it. Although Microsoft never implemented the idea, other companies now have tools that allow similar operations to be performed so space shortage is no longer a barrier to being able to run DBCC CHECKDB on you backups.
The other step that many people neglect is to perform a regular disaster recovery test where the full restore sequence is performed, to ensure that all the backups are valid and make sure that whatever scripting mechanism you have to generate the restore statements actually works. This also serves as a handy way to check that you can still perform a restore within the downtime service level agreement – remember that restores take longer as the size of the database grows.
To summarize, to increase backup integrity and the likelihood of being able to perform a successful restore, do the following:
- Use the WITH CHECKSUM option on your backups
- Take backups as frequently as you can
- Perform validation of your backups as soon as possible after taking them
- Regularly test restoring your database(s) from scratch
Even with all these steps though, you may still be unlucky and hit the situation where the database and backups are corrupt. This is why you cannot rely on backups alone for guaranteeing minimal data loss – you also need to have a synchronously-mirrored, redundant copy of the database in case the backups don’t work. Database mirroring, AlwaysOn Availability Groups, and various methods of I/O subsystem replication/mirroring can do this for you. Defense in depth – just as with security.
But that’s a whole other article… until next time, please validate your backups!