I have been a judge for the Exceptional DBA of the Year Awards ever since its inception, and one of the questions that is asked on the application form is “What is the biggest mistake you have made as a DBA, and how did you deal with it”. This question produces some interesting answers, many of which provide important lessons for all DBAs. I have collected some of those responses from past years, not for the current 2011 Exceptional DBA of Year Awards entries, which I think you will find educational. I have edited these entries for readability and to remove any potential confidential information. Hopefully you will learn from these DBAs’ mistakes, and not make the same ones yourself.
- I revoked permissions as sysadmin for myself.
- Instead of logging out of the SQL Server, I shut it down instead, right in the middle of the day when it was heavily being used.
- I was intending to drop a database from a test server, but didn’t realize I was actually on the production server, and as you have probably guessed, I deleted a production database.
- I setup a SQL Server Agent job to delete a single record each night. But I forgot to include the WHERE clause and all the data in the table was deleted when the job ran.
- I had noticed that a SQL Server Agent job running a SSIS package failed. Even though I was not familiar with what the SSIS package did, I manually re-ran the job, and it worked this time. Unfortunately, the SSIS package dropped a critical production database.
- I didn’t bother backing up some seemingly unimportant databases. When one of these databases got corrupted, there was no backup. I had to recreate the database and manually repopulate it myself.
- I neglected to pay attention to how fast our databases were growing and ran out of space without knowing it was going to happen.
- I ran a script given by me by our auditors to generate some reports, which caused the database to go into a suspect mode. Now I don’t run any scripts unless I review them first.
- I accidentally restored an older backup over a production database. I had intended to restore the backup over a test database.
- I setup log shipping backwards.
- I restarted the SQL Server service during the middle of a schema change that resulted in dropping of a production table.
- A user accidently deleted some rows on a table and I was asked to restore the database from the most recent backup. Unfortunately, I had never tested any of the backups, and none of them would restore.
- I restarted a SQL Server instance. When it came back online, one of the databases was in suspect mode. Not knowing what to do, I detached and then tried to reattach it, which made matters worse. Now I know that detaching a suspect database is about the worst thing you can do.
- A large server was experiencing some disk-related errors, so it was decided that the fix to the problem we needed to essentially rebuild the server. Of course, before doing so, I backed up all of the databases on the local machine. So we performed the hardware repairs, reformatted all the arrays, reinstalled the OS and SQL Server. It was only then that I realized that I had forgotten to move the backups off the server we had just reformatted. to another location. Fortunately, older backups were available, but not the most recent backups.
- My biggest mistake happened when I needed to take down a server to upgrade the disks that held the SQL Server data files. The plan was simple: take the SQL server down, move all the files off the RAID 1 array that held the .mdf files, replace the disks in the array (I think we were going from 73Gb HDDs to 146Gb HDDs), let the RAID 1 mirror rebuild, then put the files back, and go home. Very quickly we hit a snag (which turned out later to be a firmware bug) in that the new larger disks weren’t recognized by the RAID controller. We then put the old drives back, and they weren’t recognized as well. My biggest mistake was that I hadn’t backed up the databases or detached them. I’d simply moved the files after shutting the SQL Server down – so when SQL Server was started it would be looking for data files which simply wasn’t there! The only option was to get the server fixed, and luckily we had 4 hour support on the server, however it was rapidly approaching midnight and I was due to attend my brother’s wedding the following morning! I quickly returned home to get enough things packed in a bag so that I could go straight to the wedding once the work was complete, while we waited for the engineer and replacement RAID controller card to arrive. Four hours later we had the server back up and running – and watching that partition initialize was excruciating! I managed to get home and catch a couple hours of sleep, even though I was still buzzing from adrenaline and caffeine! The lessons were hard learned from this, and I have never since planned ANY upgrade be it hardware or software, major or minor, without an escape plan after every step.
- While upgrading a large database from SQL Server 2000 to SQL Server 2005, one that held important financial, it was decided by the business users that a large portion of the data in the database would be archived and then deleted from the production database. The end users worked with the vendor to get a script to handle this operation, and in testing it, the process took over 30 hours, which was way beyond out maintenance window. So after some review, and discussion with the vendor’s support and development staff, I realized that it would be faster to create a new copy of the table, insert the rows to be kept into it, drop the old table, rename the new one, and rebuild all the constraints and indexes. This took the process down to about 1/2 hours in testing. The problem came when I loaded the script on production to run the purge process. In testing, I had performed the operation as an explicit transaction, and initially I had left the SET IDENTITY INSERT statements out of the batch by mistake. It wasn’t a problem on development because I had blocked the code in an explicit transaction, so when the inserts failed because I didn’t SET IDENTITY_INSERT ON, I issued a rollback, waited, and then fixed the problem. For some reason, I never saved the script. When it came time for the scheduled production downtime to perform the upgrade, I loaded the script, ran it, and the table disappeared in mere seconds. The first thing I did when I realized what I had just done was to notify my manager of the problem and let him know that the system wasn’t going to be available as fast as expected, allowing him to decide how best to notify the end users if necessary. Then I took a walk around the building to think about what was going to be required to fix this, and in the shortest time possible. This two minute walk gave me the time to relax for a minute and plan what I was going to do, rather than trying to shoot from the hip. The good thing was I had backups and they had been validated numerous times during testing of the upgrade, and the database was fully logged, so I was able to backup the transaction log from production, and restore the database side-by-side, applying log backups to roll the database forward to the point in time after the downtime began, and before I executed the script. Then a few changes to the original script allowed me to move the data that was supposed to be retained from the restored copy into the production database, and the system was able to come back online with zero data loss.
If you have any mistakes you would like to share, anonymously or not, please do so below.