If there’s one phrase that makes me laugh, whenever I hear it from developers or application owners, it’s "My system doesn’t have bad data!"
Chances are, it does. Chances are there are orphaned rows, incorrect values in columns, out of range values or just general garbage that sneaks in to any system once it gets used.
Customers who have the name ‘Dummy **DO NOT USE**’ with thousands of dollars of purchases, gender columns with values like ‘asdfdasl’ or ‘Yes’ in them, deliveries which were dispatched in the future and arrived before the fall of Rome, or purchases of products which just aren’t to be found in the product catalogue. I have seen all of these in production systems in the last couple of years.
Chances are that most production systems have some degree of bad data in them. But how does this happen? How do we end up with gender columns which contain 57 different values? Or shipment records which have no information about where the shipment was sent from?
I’m going to borrow and misuse a quote from the physicist Murray Gell-Mann:
“Everything not forbidden is mandatory.” (http://en.wikipedia.org/wiki/Totalitarian_principle)
If we do not explicitly prohibit bad data from entering the database, eventually some will get in there. Whether it be from bugs or missing validation in the application, someone connecting Excel to ‘just do some quick analysis’ or from the web page that the CEO’s nephew wrote up quickly last weekend.
Why do we allow this? Why do we not create the required constraints to prevent this kind of garbage from entering the database? Is it too much work? Are we too busy to do a complete design job? Is it ‘unagile’ to create lookup tables, foreign keys and check constraints? Or do we just enjoy so much cleaning up the bad data once it is in our systems?
I've laughed over some of these examples of bad data, after the event, but when you run across them in a database from which you're trying to get meaningful data, they aren't quite as amusing. If you'd like to share your weirdest or most worrying examples of "dirty data" in a production database, I'd love to hear them.
Gail Shaw (Guest Editor).