The answer: It Depends.
One of the more common problems I encounter when managing data quality, especially in an ETL process, is the handling of null values, blanks, and zeroes. When I put on my preaching shoes to talk about bad data, this is one the areas I have to spend a lot of time covering because it is so often overlooked as a stumbling block for data quality. Improperly interpreting what is represented by these values can lead to inconsistent results, dirty data, and incorrect answers to business questions.
Data Modeling
When managing text data, more often than not an empty string is treated the same way as a null value. For example, if a record in a Person table has a valid, non-blank value in the FirstName and LastName field but a non-null, empty string value in the MiddleName field, it’s usually assumed that the person has no middle name, or that the middle name is unknown. However, what if the data in question is the database for the official birth registrar for the municipality? A record with an empty string for the middle name could show that the person’s official birth record indicates that they legally do not have a middle name, while a null value in that field reflects an unknown value for the middle name (possibly to accommodate old records transcribed from incomplete handwritten birth certificates). In this case, an empty string value for MiddleName represents a completely different situation than a null in the same field.
It’s even trickier with numeric data. Zeroes and nulls are often treated as one and the same, but in many business cases, a null value is intended to be treated differently than a zero. Let’s say I’m working for a sales organization, and I want to calculate some metrics on some of our sales leads. If I’m arbitrarily interpreting a null value as a zero in the HouseholdIncome field, I’m going to get some incorrect results when calculating, for example, the average household income of our potential leads (since zeros are factored into averages, while nulls typically are not). In this case, a zero value indicates that the reported household income for that lead is zero, while a null value could indicate that the person refused to provide any answer for that field.
The ETL Factor
Complicating both of these is the exchange of data in ETL. Assuming the data is correct in the source system, it is possible – in fact, I would say it is likely – that some part of the ETL process can misinterpret what nulls, empty strings, and zeroes are supposed to represent. Part of this is a lack of standard nomenclature. How does one represent a null in a character field? If an empty string is used do indicate a null, then how does one indicate an empty string when that value should be separate and distinct from a null? On occasion, I’ve seen the literal word NULL used in text data to represent a null value. Find yourself a phone book (do they still print those?) of sufficient size, and you’ll find that there are individuals with the last name of Null. It can’t always be assumed that the string literal Null can be interpreted as a null value.
There’s no silver bullet for the null/empty string/zero quandary. But there are ways to mitigate the possibility of a misinterpretation causing pains in your database or ETL process:
- Know your data. I don’t mean that you just need to know its technical structure – you need to really know what it represents to your employer or client.
- Know how your data is being used. Talk to your users and find out how they create, curate, and report on the data. Ask them about any known outliers in the data. This will reveal a great deal about what the data means (see the previous bullet).
- Pay special attention to ETL, reporting, and analytics. These areas are hotspots where a misinterpretation of a potentially null value can be especially problematic.
- Have a solid, documented data mapping strategy. Given that many of the problems in this realm are rooted in moving data, having a well-documented and visible source-to-target mapping can help to clarify how these special values should be interpreted during data movement and transformation.
- Know where the quicksand is. Sometimes, an empty string and a null are synonymous by design. Solve the problems that need to be solved, and don’t get stuck working on those that don’t.
The null/empty string/zero problem isn’t going away anytime soon, but simply being aware that it’s out there can help avoid misinterpretations leading to bad data.
The post Null, empty string, or zero? appeared first on Tim Mitchell.