When most of think about doing a backup, we think a SQL backup to disk or
tape - or for the more adventurous, using the capabilities of a third party
backup solution to do the same thing. Some combination of full, differential,
and log backups (you ARE doing log backups aren't you) gives us the ability to
recover if something bad happens.
Bad usually falls into two categories. The first is hardware. Something
happens to the server that forces you to restore or maybe even rebuild. The
second and more common is user error. Good applications usually do a pretty good
job of limiting the amount of damage a user can do (one common way is by just
marking items as deleted, not actually ever deleting anything) but it can still
happen. It can also happen at the developer level, something that makes it
through whatever flavor of QA you have in place. Where I work we have a 'in
between' scenario. We have a team that does nothing but process data received in
a zillion different formats and load in into our system, they also do
maintenance on existing data. That maintenance might consist of updating 100k
rows because we got wrong values from the client initially. We try to
standardize those operations, but obviously they represent another opportunity
for error.
Still, a good backup plan covers all of that, right?
Ok, you're right, it does. But let's think about what it means and how it
works. Let's say one of these power users runs an update that updates the wrong
column, sets the wrong value, leaves off the where clause. Something bad.
They call you, but the query has already completed. You can't actually bring the
database down because you've got other employees working against it (billable
time you know!), so you need to:
- Restore your last full backup (how long does it take to restore?) to a new
database (do you have the space?)
- Restore all the transaction logs up to 30 seconds before the mistake
happened (more time)
- Join original table to new table in the restored copy, figure out what
action or actions will fix the problem (you probably have to do this, not
the person who caused the problem)
If your database is any significant size this is going to take a while. I'd
bet half an hour minimum from the time you learn about the problem to having the
data corrected. If the db is large, it could be longer. You're tied up,
potentially users can't work or have bad/incomplete/incorrect data, you're
burning disk IO to get the restore done.
So....options? You could give the power users restore permissions and let
them do it. Just kidding. It's one thing to look at letting a user initiate a
backup, something else to let them do a restore on a production server.
If you're doing log shipping with a decent amount of latency, you can bring
the spare out of standby, set up a linked server to it, query across servers to
get the information you need. Potentially this is faster than doing the restore
but you need to practice it a couple times - if you don't stop the log shipping
before the bad data goes across...well, back to plan A!
If you're replicating to a reporting server you might be able to get the data
from there as well. Again it depends on your latency. Transactional replication
in a 100m switched environment is typically a few seconds. Worth considering
slowing it down to give you the option. If you're using a snapshot you may not
be able to get 100% of the data you need depending on the frequency of the
snapshot/how long after the snapshot the mistake happened. If you're using
transactional, go stop the distribution agent right away!
Logging changes to a history table (auditing) is a very valid technique.
You've got the data you need right there, just a matter of determining which
rows to join to and what to fix. Including hostname and username in the auditing
table can make this a lot easier. The tradeoff is the extra disk IO generated by
the triggers on each table and the each disk space needed to store the history.
Log Explorer from Lumigent lets you back out the change at the transaction
log level - assuming you own the tool and you're not truncating the log!
In most cases ALL of the above methods will require you to get involved to
fix the problem. Depending on your environment you may feel comfortable giving a
few key users access to the auditing tables if you went that route, or access to
the reporting server if you're using replication.
This last idea may surprise you and it's not workable in all cases. Have the
users make a copy of the table (or of the chunk of rows they are about to change
if the table is huge) in MS Access. A 'make table' operation in Access is
reasonably quick (compare to 'select into') and gives the user the ability to
join to a local 'backup' table if they want to verify the changes afterwards or
they need to back them out. Similar but not as simple to implement ideas would
be show them how to create temp tables or to run a server side process using
DTS/BCP that would make a copy.
As a DBA my primary concern is protecting the data and making sure the data
is available to the people that need it. Backups do that. Beyond that, it makes
sense to implement controls where ever possible to restrict the amount of damage
that can be done - requiring all data access be done via stored procedure for
example. Yet I know...and you should too...that mistakes will happen. If you
rely solely on a backup you're going to work harder and take longer to fix the
mistake than if you've thought through and have ready to go one or more of the
above options. You also have to recognize that being the only one who can fix
the mistake makes you the bottleneck. As long as you can be responsive, that's
ok. On the other hand, having someone that can fix 'minor' mistakes without
calling you...is that a bad thing?
Agree? Disagree? Add your comments below and tell me what I got right...or
wrong!