Today we have a guest editorial as Steve is out of town.
There is an old quip that goes something like this "It's not that he doesn't know anything, it's that he knows so much that isn't so." This pretty accurately sums up what so often passes for common wisdom in the database world. Over the next few posts I want to address some phrases that if you've been around databases for more than a couple of years you have heard tossed around as if they were axioms, but which on serious reflection are nonsense.
Today I'm going to explore Unstructured Data. "Unstructured data" is an oxymoron. There is no such thing and never can be. Unstructured data is literally nonsense, it's a bunch of 1's and 0's in no particular order and it means nothing; it's white noise. If it's unstructured, it's not data, and if it's data it cannot possibly be unstructured.
The common examples of unstructured data are things like books, documents, videos etc. All those formats are HIGHLY structured and must remain so if they are to retain their ability to convey information. In fact, when people say "unstructured data," what they really ought to say is complex structured data. What they intend to convey is the idea that since the data contained in the book or video is so complex and organized internally, we don't need the database to manage that structure. This is true for nearly every data element even for the very simple data types.
Let's take an integer data type, it doesn't get much simpler than that. The DBMS has no way of "knowing" whether the data stored in an integer column is correct (should it be 123 or 321?). The data has to be assumed to be accurate and consistent within itself. The DBMS can only ensure that the data it stores is plausible, i.e. that it is in fact an integer and it also meets any constraint criteria that have been defined for it.
Sometimes when people talk about unstructured data they refer to the idea that they don't know what kind of data they are dealing with and thus can't define a schema in which it can be effectively held. We often hear of the benefits of "schemaless" databases in the brave new world of "big data" or NoSQL. It's true that some NoSQL database technologies (notably document store and key value databases) do not require one to define all the data in the documents stored in it. This is touted as being highly flexible and it is to some extent. However, if there has been no effort to define what the data is, good luck querying it back out.
The fact of the matter is that a schema gives context and meaning to data. A truly schemaless database does approach unstructuredness, but the closer you get to that the more difficult will be the task of making any sense of it. Unless the intent is simply to store data a schema is required. Whether that schema is an integral part of the DBMS in relational, hierarchical, and graph databases, or whether it is merely implied as in document, key value, and column family databases, it must still exist for the data to be useful.
One of the stated advantages of the implied schema technologies is the ability to change the implied schema at any time without affecting the data that has been previously stored. However, a moment's reflection should sound alarm bells. Yes, you can, but the real question is; should you? It's often stated that data is the lifeblood of business. Would it be awesome if we could change our blood type at a whim? What if someone else could do it for you without your knowledge? It may be kind of cool and there are certain circumstances in which it might be a lifesaver, but it could also be lethal if the rest of your body which depends on that blood isn't changed at the same time. Or if the blood were changed to oil. Precisely the same thing can happen to your database in these implied schemaless databases.
As professionals we have to be able and willing to think about things and be critically aware of the tradeoffs involved rather than blindly rushing toward the next big thing. We should not allow ourselves to unthinkingly parrot marketing slogans. There are genuine arguments to be made for using one technology or another, but "unstructured data" isn't one of them. It's a term that should not be used. It does convey some meaning due to common usage, but it also conveys much that just isn't so.