Should you ever delete data? In a production environment, do the benefits of deleting old data outweigh the possible risks?
Data quality is important. Whether you refer to it as data integrity, permanent retention, or simply maintaining a complete audit trail, it can be effectively argued that deleting data from a production database diminishes the big picture of your data. After all, any data that is worthy of storing, backing up, optimizing, and mining is worth storing permanently. Deleting data affects the ability to thoroughly research historical activity, and can impact reports and aggregations on the remaining data. Storing only the rolled-up data, such as end-of-year financial reports are often not sufficient, because auditors or financial personnel may need to drill down to the lowest level of detail. Other information, including certain healthcare data, is best kept forever (and in some cases, is legislated so) to ensure a proper legal record should it be necessary for judicial or civil proceedings.
The need to routinely delete data was far more critical when storage was more expensive, in terms of dollars and system time. Purchasing disks for storage has never been cheaper, and with modern 15000 RPM drives and solid state disks, data access times continue to improve. Removing data simply for the sake of saving bytes on a platter is not as critical as it was just a few years ago. Data can be retained indefinitely, in the original store or in a separate archive (another table or a different database altogether).
To be clear, I’m not taking on DBAs who use the DELETE functionality to eliminate data. A proper data retention policy would involve all levels of an organization, from the CXOs to the technical staff and end users. And a competent retention policy doesn’t have to mandate that data remains in the RDBMS – information can be stored in the database, database backups, the filesystem, magnetic tape or optical disk, or a combination of several of these. The specifics of permanent data storage should be dictated by how frequently or quickly the data would need to be accessed.
There are times when deleting data is expected and even commonplace. When staging data in temp tables or table variables, one would expect deletion of data during that processing. Any process that writes data out to an archive store would naturally need to delete data from the original location, though this could better be considered a move rather than a delete. Sensitive data which would never be reported on or reused is expected for the protection of customers or clients – the deleting of credit card numbers after a charge is successfully posted would fall into this category.
Unfortunately, this decision does not reside with database administrators alone, or even with their employing organizations. Some vendor applications will routinely delete older, less-often used data as part of a purge to better performance or decrease storage requirements. I recently experienced this with a healthcare vendor during a conversion from their product to a newer system. It was discovered during the planning phase of the conversion project that this vendor’s system was hard-coded to purge the detail data from old accounts. Although we were able to reconstruct some of the data using other means, the ability to thoroughly report on that historical data has been permanently and irreversibly diminished.
The bottom line is that you should ask yourself whether you could ever need the data you are deleting. You shouldn’t just ask whether it is likely that you will need the data again – approaching from this angle will eventually come back to bite you. A more appropriate question would be whether you can imagine any scenario, however unlikely, that would require you to reference the data in the future. Eventually your boss/the board/the CFO/the auditors will come calling, and you’ll be glad you have your safety net.