Having just read Santveer Singh’s article Which databases are being backed up (another look), I thought the following might add something to it. Our situation is slightly more complex. We develop software applications that make use of replicated databases. Our users are distributed across Europe, and in some cases consist of small offices with no real system administration personnel. Consequently we typically manage their servers for them remotely. Part of this is the backups, and with the number of servers involved this is a bit of a nightmare. I came up with the following solution which hopefully might be of use to others as well. In addition to checking back up status we also use it to monitor disk space for critical disks, and replication status, though for conciseness I’ll ignore that part for now.
We use maintenance plans to manage our backups. Each system database is backed up once daily, as is each application database. Transaction log backups for application databases are taken every 15 minutes. Backups are done to a separate live disk on the machine, and then backed up to tape as part of the sites daily tape backup. The maintenance plan is set to keep a single full backup and 24 hours of transaction log backups (in the event that anything earlier is needed we can ask the site to pull the relevant tape, but this has not been needed to date). The system I’ve put in place tells us if the right number of backups exist, if any backups have failed, if any verifies have failed, or if any of the deletes have failed – the failed delete is particularly important as the databases are large, and failures to delete could rapidly fill a disk.
My system is simple enough. It uses msdb.dbo.sysdbmaintplan_history as the source of its information. This table is queried hourly to extract the statistics we require, and the results placed in a table in the database SysAdmin (a database we use on all our servers for holding system management related data and procedures). This table is replicated hourly to one of our local servers. We could have avoided replication since one of the maintenance plan options is to send history information to a remote server. We decided against this for the following reasons:
- We had replication in place any way
- It was something that could be missed when setting up the maintenance plan
- The current approach generates less network traffic
- We had no idea how reliable the alternative would be
Once the information is in place on our local server we’re free to extract any statistics we require from it. We currently do this in two ways. There is a daily task scheduled on the server that extracts information on problems and mails a report to designated users. Also I have written an MMC snap-in that displays the current status of all sites and highlights any problems.
The various tables, views and procedures to implement this are shown below.
BackupList is used to specify which databases should be checked on each server. The server column should match the value returned by @@SERVERNAME for the server in question.
CREATE TABLE [dbo].[BackupList] ( [guid] uniqueidentifier ROWGUIDCOL NOT NULL , [ID] [int] IDENTITY (1, 1) NOT NULL , [Server] [varchar] (50) NOT NULL , [DatabaseName] [varchar] (50) NOT NULL ) ON [PRIMARY]
BackupLog is used to store the results of our queries of the maintenance plan histories.
CREATE TABLE [dbo].[BackupLog] ( [guid] uniqueidentifier ROWGUIDCOL NOT NULL , [TestDate] [datetime] NOT NULL , -- Default GetDate() – records time of last test [DatabaseName] [varchar] (50) NOT NULL , [Server] [varchar] (50) NOT NULL , [IsSystemDB] [bit] NOT NULL , -- Set if the database is a system database [BackupCount] [int] NOT NULL , -- Number of full backups in last 24 hours [BackupFails] [int] NOT NULL , -- number of failed backups in last 24 hours [TransCount] [int] NOT NULL , -- number of transaction logs in last 24 hours [TransFails] [int] NOT NULL , -- number of failed transaction logs in last 24 hours [VerifyFails] [int] NOT NULL , -- number of failed verifies in last 24 hours [DeleteFails] [int] NOT NULL -- number of failed deletes in last 24 hours< ) ON [PRIMARY]
Code
The following procedure performs the actual checks for a specific database
CREATE PROC dbo.adm_CheckBackups @dbname varchar(50), -- database to check @IsSystem bit=0 -- indicates a system database AS DECLARE @tlog_fails int, @tlog_count int, @verify_fails int DECLARE @backup_fails int, @backup_count int, @delete_fails int -- Return count of failed transaction log backups in the last 24 hours SELECT @tlog_fails=COUNT(*) FROM msdb.dbo.sysdbmaintplan_history WHERE database_name LIKE @dbname AND activity LIKE 'Backup transaction log' AND start_time > DATEADD(hour, -24, getdate()) AND succeeded=0 -- Return count of transaction log backups in the last 24 hours, -- whether they succeeded or not SELECT @tlog_count=COUNT(*) FROM msdb.dbo.sysdbmaintplan_history WHERE database_name LIKE @dbname AND activity LIKE 'Backup transaction log' AND start_time > DATEADD(hour, -24, getdate()) -- Return count of failed verifies in the last 24 hours SELECT @verify_fails=COUNT(*) FROM msdb.dbo.sysdbmaintplan_history WHERE database_name LIKE @dbname AND activity LIKE 'Verify Backup' AND start_time > DATEADD(hour, -24, getdate()) AND succeeded=0 -- Return count of failed full backups in the last 24 hours SELECT @backup_fails=COUNT(*) FROM msdb.dbo.sysdbmaintplan_history WHERE database_name LIKE @dbname AND activity LIKE 'Backup Database' AND start_time > DATEADD(hour, -24, getdate()) AND succeeded=0 -- Return count of full backups in the last 24 hours, whether they succeeded or failed SELECT @backup_count=COUNT(*) FROM msdb.dbo.sysdbmaintplan_history WHERE database_name LIKE @dbname AND activity LIKE 'Backup Database' AND start_time > DATEADD(hour, -24, getdate()) -- Return count of failed deletes in the last 24 hours SELECT @delete_fails=COUNT(*) FROM msdb.dbo.sysdbmaintplan_history WHERE database_name LIKE @dbname AND activity LIKE 'Delete old%' AND start_time > DATEADD(hour, -24, getdate()) AND succeeded=0 BEGIN TRANSACTION -- Clear the previous results for this database on this server DELETE FROM BackupLog WHERE Server=@@SERVERNAME AND DatabaseName=@dbname -- Create a new record with the current information INSERT BackupLog(DatabaseName, Server, IsSystemDB, TransCount, TransFails, VerifyFails, BackupCount, BackupFails, DeleteFails) SELECT @dbname, @@SERVERNAME, @IsSystem, @tlog_count, @tlog_fails, @verify_fails, @backup_count, @backup_fails, @delete_fails -- If there are any problems put things back as they were IF @@ERROR<>0 ROLLBACK TRANSACTION ELSE COMMIT TRANSACTION
The next procedure calls the check procedure for each applicable database – it is this procedure that is called hourly as an sql server agent job:
CREATE PROCEDURE adm_CheckAllBackups AS -- First do the user databases specified in the control table DECLARE cr CURSOR READ_ONLY FOR SELECT DatabaseName FROM BackupList WHERE Server LIKE @@SERVERNAME DECLARE @dbname varchar(50) OPEN cr FETCH NEXT FROM cr INTO @dbname WHILE (@@fetch_status <> -1) BEGIN IF (@@fetch_status <> -2) BEGIN EXEC adm_CheckBackups @dbname, 0 END FETCH NEXT FROM cr INTO @dbname END CLOSE cr DEALLOCATE cr -- finally do the system databases – these are done automatically, -- and do not need to be specified in the BackupList table EXEC adm_CheckBackups 'master',1 EXEC adm_CheckBackups 'model',1 EXEC adm_CheckBackups 'msdb',1 -- The distribution database will not exist on all servers, so a -- check for its existence is performed first IF EXISTS (SELECT * FROM master..SysDatabases WHERE name LIKE 'distribution') EXEC adm_CheckBackups 'distribution',1
Almost done now – the following view is used to support the summary which is emailed to staff. Notice that the first error checked for is if an update hasn’t been received. This would indicate either a replication failure or a failure to run the statistics gathering job on the remote server.
CREATE VIEW vBackupReport AS select 1 AS Severity, 'Update not received' AS Error , Server, DatabaseName, TestDate from backuplog where testdate<DATEADD(day, -1, getdate()) union select 5 AS Severity, 'System Backup not taken' AS Error , Server, DatabaseName, TestDate from backuplog where backupcount=0 and issystemdb=1 union select 5 AS Severity, 'System Backup failed' AS Error , Server, DatabaseName, TestDate from backuplog where backupfails>0 and issystemdb=1 union select 9 AS Severity, 'Application Backup failed' AS Error , Server, DatabaseName, TestDate from backuplog where backupfails>0 and issystemdb=0 union select 9 AS Severity, 'Application Backup not taken' AS Error , Server, DatabaseName, TestDate from backuplog where backupcount=0 and issystemdb=0 union select 6 AS Severity, 'Application Backup (transaction) failed' AS Error , Server, DatabaseName, TestDate from backuplog where transfails>0 and issystemdb=0 union select 6 AS Severity, 'Application Backup (transaction) not taken' AS Error , Server, DatabaseName, TestDate from backuplog where transcount<90 and issystemdb=0 union select 4 AS Severity, 'Backup Verify fails' AS Error , Server, DatabaseName, TestDate from backuplog where verifyfails>0 union select 2 AS Severity, 'Backup Delete fails' AS Error , Server, DatabaseName, TestDate from backuplog where deletefails>0
The output from this view is formatted a bit and send by email to a distribution list on our site.
Conclusion
Hopefully the above might prove useful to someone. It’s all fairly simple stuff, but it has made checking of remote system states far easier for us, and has given us a much greater degree of confidence that every site is working as it should be.