Steve Jones is traveling today and we have a guest editorial from Phil Factor.
Database people have a natural suspicion of the quality of data until they’ve proven to their satisfaction that it is OK. If the data is bad, then the inferences from it will be wrong. I use the term ‘data’ in the broad sense. I’ve always assumed, though, that a photocopy or a scanned image is likely to faithfully represent the original document, or at least give some evidence of failure if something went wrong with the process: Not so, it seems.
It seems that a scanner/photocopier that uses industry-standard patch-based lossy compression for images, can manage, under very specific circumstances, to get numbers wrong when reproducing a column of figures. David Kriesel, who brought this to general attention recently, discovered by accident that a Xerox Workcentre 7535 Scanner/Photocoper can randomly replace a patch of the scanned image with something very similar. He even found an example where ‘17,42’ and ‘21,11’ were both replaced by ‘14,13’. To a machine, a 6 is very similar to an 8 or a 5, but that wouldn’t hurt unless the height of the digits matched the patch height used in the algorithm. It seems to be the case that any scanning or copying machine uses the JBIG2 compression algorithm in lossy mode, there is a danger that it will replace whole blocks of images because their pattern/color is similar.
This isn’t a new technology. JBIG2 has been around since 2000. It is used on many devices. The Xerox Workcentre 7535 Scanner/Photocoper on which the effect was discovered is not a new product, and is no longer sold as new, but there are plenty around. It isn’t even a bug, but is actually mentioned in the original manual for the Xerox Workcentre for fax, workflow scanning and email. You can alter the ‘Quality / File Size’ setting when scanning. These settings allow you to choose between scan image quality and file size. In the ‘normal’ setting (The ‘normal’ setting is not, evidently, the default setting), the device ‘produces small files by using advanced compression techniques. Image quality is acceptable but some quality degradation and character substitution errors may occur with some originals’ (p107, 129, 179 Xerox® WorkCentre™ 5735/ 5740/ 5745/ 5755/ 5765/ 5775/ 5790 User Guide 2010).
From this, it is apparent that the engineers were aware of the potential risk. A warning evidently also exists in the web interface.
Since the original post, a number of other people have checked and confirmed the problem. It would be odd indeed if this has suddenly started happening on a technology that has been used for over a decade, just on a particular machine that has been popular for years, that mentions the risk in its documentation. DBAs are natural pessimists, and familiar with the consequences of data corruption, so are likely to wonder what other errors in figures have happened in the past, maybe to financial records or medical doses? Where companies have used these devices to scan all their correspondence in a ‘lossy’ format, and then dispose of the original documents, there must be some worries now if they’re involved in any litigation that requires documentary evidence. Where people have scanned using JBIG2 and then OCR’d numeric data from the image for the past few years, it may be time to double-check against the original data if they still have it.