Every time you build a table you need to decide right then if and when data
should be removed from it, either to be archived to offline storage or to just
be deleted. How many of your applications have one or more tables that are just
for logging - maybe something a process records in case something breaks, or an
auditing table to track changes? Or tables that keep a real history of something
done, but that history just isn't worth much a year or two later?
I've seen this happen several times over the past year, but one incident
lately really brought it in to focus. I was doing some maintenance on a very old
process app that emailed various groups any time certain changes were made in
our project management system. This is really a workflow notification process.
Each time it runs it would calculate which changes had occurred, email those
changes, log them to a table, and update the table so the notification
didn't get repeated. An example might be that a project went from in progress to
final - to indicate to accounting that they need to close out the invoicing
process. Even if they don't get the email, how bad is it? The data still has the
correct status and it's reasonably easy to see what should be where.
So what good is the log table? If the process breaks in mid process (ah, the
reason for the above mentioned maintenance), you can just restart it since
notifications that were completed won't get processed again. No reason to look
at the log table. Possibly if you made a change to the logic and wanted to
validate the output you could use it for a few days to double check? Beyond
that, would I ever care what email got sent to who for something relatively
trivial like this?
Remember, I'm not saying that all log tables are junk. They definitely are
not! But in this case it was really a debugging tool (also known as CYA). So how
much history is worth keeping to handle things like 'hey, I didn't get notified
on that last week'?
Thirty days? A year? Anyone want to guess how much history we have? It starts
in 1996!
It's horrible, truly horrible. We're reindexing, backing up, storing data
that has NO value to us. We're spending server IO and cpu time inserting the
records. Yes, we do that stuff as part of our daily maintenance but it adds up.
How many more of these do I have?
It's not just log tables either. Do you really need those contacts you added
from a seminar back in 1998 that never panned out? Or the list of all changes
made to the orders table from 1997, years after the books have been reconciled
and closed?
If you decide when you build it, it's easy. Six years later? I have to make
sure no one has decided to report on the data, use it to trigger/augment some
other process, etc. I can't just truncate the table!
I'm still working out the best way to handle these issues, so far I've been
implementing one 'CleanUp' job per database that has multiple steps containing
things like 'delete from history where dateadded < getdate()-7'. Not
everything is that simple of course, sometimes you have to handle getting it on
tape or CD first. Another issue is that you don't always know what makes sense
for archiving when you build it. My recommendation is to either review all your
tables once a year or set a reminder (in a SQL job!) to notify the DBA in a year
that an archive plan needs to be implemented. An alternative technique is to
manage the task from within the various applications. Sometimes that seems
right, sometimes not.
So what about it? Got an archive plan for your tables? Have a story worse
than mine? Let's talk about it! Post your comments in the attached discussion
forum and we'll see what happens!