Author's Note
When seed of this article first sprouted in my mind, XML was the biggest buzzword of the day.
- XML would lead us all into a brave new world where systems would be easily integrated by virtue of those magical little tags, and if XML was good for that just think how wonderful it would be if all data were in an XML format.
- XML database management systems would surely be the next evolutionary step that would finally eliminate the dreaded "impedance mismatch."
- Every self respecting software vendor just HAD to be using XML everywhere!?
Anything that is reported to be that wonderful naturally aroused my suspicions. I set off to find out what the fuss was all about. After many months of research and testing I came to the conclusion that not only was XML not "the answer" to any universal data exchange/management problem, it is largely a stupid answer to all the wrong questions!
More than that, it became clear that XML is in fact a giant leap in the wrong direction, perpetuated by those who have no understanding of data management fundamentals, upon those who have no understanding of data management fundamentals.
The original article was initially an internal document that I sent to management in an attempt to curtail the practice of our programmers stuffing XML into large varchar fields in the database. I'm happy to report that my effort was largely successful. Our company policy was changed to allow programmers to use XML for communication, but that it would not be stored in a database as such. Unfortunately, many of our vendors had remade their products in the image of XML and without exception, the problems we faced, from increased storage space, to bloated log files, to hugely inefficient queries, just reinforced my views about XML.
At my manager's suggestion, I cleaned up the document and submitted it for publication to SQL Server Central. I never thought it would be published because it was so contrarian –some would say inflammatory. To my surprise, it was published and an even greater surprise was the large number of thoughtful and supportive comments I received (I expected to get nothing but flamed.)
Since that time XML is no longer the hottest industry buzzword, but unfortunately it has taken strong root in almost every aspect of data management. I, and many others who are much smarter than me, still strongly maintain that XML solves NONE of the problems that its proponents claim and that if one bothers to do more than a cursory examination, XML actually causes more problems than it could possibly ever hope to solve.
When I found out that the article was scheduled to be republished, I asked Steve if I could make some revisions to the article. I re-checked the links and I encourage the reader to read them as they form much of the foundation for the article's conclusions (several of them have been updated recently.) What follows is the original article with a few minor revisions which I hope clarify some points and amplify others.
The conclusions, however, remain the same.
Introduction
Despite the breathless marketing claims being made by all the major vendors and the natural desire to keep skills up-to-date, it would be prudent to examine exactly what advantages are offered by XML and compare them to the costs before jumping headlong into the XML pool.
The idea of XML is simple enough; basically just add tags to a data file. These tags are sometimes referred to as metadata. XML is inherently and strongly hierarchical. The main benefits are touted as being:
- Self describing data
- Facilitation of cross-platform data sharing or "loose coupling" of applications
- Ease of modeling "unstructured" data
Self-describing
At first, the idea of "self-describing data" sounds great, but let's look at it in detail. A classic example of the self-describing nature of XML is given as follows:
<Product>
Shirt
<Color>Red</Color>
<Size>L</Size>
<Style>Hawaiian</Style>
<InStock>Y</InStock>
</Product>
One possible equivalent text document could be as follows:
Red,L,Hawaiian,Y
Anyone can look at the XML document and infer the meaning of each item, not so for the equivalent simple text document. But is this truly an advantage? After all, it's not people we want to read the data files, it's a machine. Which is more efficient for a machine to read or generate? Which makes better use of limited network bandwidth? The XML file is more than six times the size of the plain text file. In my experience XML files will tend to be around 4-5 times the size of an equivalent delimited file. Due to the bloated nature of XML, hardware vendors are actually offering accelerators to compensate. Worse yet, there are more and more non-standard XML parsers being written to "optimize" XML, thus completely destroying any illusion of "compatibility."
(See http://techupdate.zdnet.com/techupdate/stories/main/0,14179,2896005,00.html). Some have argued that compression tends to nullify the bloat factor of XML. It shouldn't take too much contemplation to figure out that this is pure nonsense. If you start off with a file that is 4 times the size of an alternative equivalent and then compress it, you will be lucky to get down to the same size as the non-XML file, then what if you compress the non-XML file? Besides, this argument assumes that compression is "free." Compression does save bandwidth and disk space, but it costs processing time.
Communication facilitation
The self-documenting nature of XML is often cited as facilitating cross application communication because as humans we can look at an XML file and make reasonable guesses as to the data's meaning based on hints provided by the tags. Also, the format of the file can change without affecting that communication because it is all based on tags rather than position. However, if the tags change, or don't match exactly in the first place the communication will be broken. Remember that, at least for now, computers are very bad at guessing.
Regardless of the specific means used, communication across systems requires two things:
- Agreement on what will be sent (what the data means), and
- Agreement on a format.
Using XML does not alter this requirement, and despite claims to the contrary, it doesn't make it any easier. In order to effect communication between systems with a text file, both the sender and receiver must agree in advance on what data elements will be sent (by extension, this mandates that the meaning of each attribute is defined), and the position of each attribute in the file. When using XML each element must be defined and the corresponding tags must be agreed upon. Note that tags in and of themselves are NOT sufficient to truly describe the data and its meaning, which of necessity includes the business rules that govern the data's use unless a universal standard is created to define the appropriate tag for every possible thing that might be described in an XML document and that standard is rigorously adhered to. (See http://www.well.com/~doctorow/metacrap.htm) That XML tags are supposed to be "self-describing" has lead many to wrongly assume that their particular tags would correctly convey the exact meaning of the data. At best tags alone convey an approximate meaning, and approximate is not good enough. In fact it has been noted that XML tags are metadata only if you don't understand what metadata really is. http://www.tdan.com/i024hy01.htm
No matter the method of data transmission, the work of correctly identifying data and its meaning is the same. The only thing XML "brings to the table" in that regard is a large amount of overhead on your systems.
Unstructured data
The very idea of unstructured or semi-structured data is an oxymoron. Without a framework in which the data is created, modified and used data is just so much gibberish. At the risk of being redundant, data is only meaningful within the context of the business rules in which it is created and modified. This point cannot possibly be overemphasized. A very simple example to illustrate the point follows: the data '983779009-9937' is undecipherable without a rule that tells me that it is actually a valid part number. Another example often thrown about by XML proponents is that of a book. A book consists of sections, chapters, paragraphs, words, and letters all placed in a particular order so don't tell me that a book is unstructured!
Again, what benefit does XML confer? None. The data still must be modeled if the meaning is to be preserved, but XML is inherently hierarchical and imposes that nature on the data. In fact it has been noted that XML is merely a return to the hierarchical databases of the past. The problem is that not all data is actually hierarchical in nature. The relational model of data is not inherently hierarchical but it is certainly capable of preserving hierarchies that actually do exist. Hierarchies are not neutral so a hierarchy that works well for one application, or one way of viewing the data, could be totally wrong for another, further eroding data independence. http://www.geocities.com/tablizer/sets1.htm
Herein lies the real problem. No matter how bloated and inefficient XML may be for data transport, it is downright scary when it is used for data management. Hierarchical databases went the way of the dinosaur decades ago, and for good reason; they are inflexible and notoriously difficult to manage.
I can understand why many object-oriented programmers tend to like XML. Both OO and XML are hierarchical and if you are used to thinking in terms of trees and inheritance, sets can seem downright alien. This is one of the fundamental problems with the OO paradigm and it's about time that data management professionals educate themselves about the fundamentals. Set theory and predicate logic (the foundations of the relational model of data) have been proven superior to hierarchical DBMS's, which are based on graph theory. Why is it that data integrity always seems to take a back seat whenever some programmer cries about the perceived "impedance mismatch" between OO and relational data? Why is it that the "fault" is automatically assumed to lie with the database technology? http://www.geocities.com/tablizer/oopbad.htm
What I am seeing is a push from many development teams to store raw XML in the database as a large varchar, or text column. This turns the database into nothing more than a simple staging ground for XML. This, of course violates on of the first principles of database design: atomicity, or one column, one value (speaking very loosely). How can a DBMS enforce any kind of integrity on a single column containing raw XML? How do I know that the various XML strings stored in a given table are even related? Indexing and optimization using such a scheme is impossible. (Note: When this article was first published MS was just beginning to hint that " Yukon" would include an XML data type. Now SQL Server 2005 is at the door promising to support "XML Goodness" natively in the database. Regardless of how well MS implements the XML data type in terms of compression and indexing the basic fact is that XML is suited only for hierarchical data and is not suited for general data management.)
Vendors
Why are the major hardware and software vendors so excited about XML if it is so bad? There are several possibilities:
1. Ignorance. Let's face it; marketing departments drive product development, and marketing departments like noting more than for their products to be fully buzzword compliant. Ignorance on the part of the consumer community (you and I) allows us to get swayed by all the marketing-speak.
2. Stupidity. The technical "experts" are often ignorant as well, only they have no excuse, so I call it stupidity. I spent several hours at the last SQL PASS Summit trying to find someone on the SQL Server product team who could provide a single good reason to use XML. By the end of the conversation there were at least five "experts" around the table, all unable to make their arguments hold up to the scrutiny of reason. Some of the answers they gave were shockingly stupid. One of these "experts" stated that the biggest benefit of XML is to allow programmers to easily adapt a database to changing needs by "loading" columns with multiple attributes of which the database is unaware! At that point I knew that I was wasting my time. After two hours, I left with a growing sense of alarm about the future direction of SQL Server. I'm sure they were glad to see me go so they could get back to their fantasy world of XML nirvana. Instead of taking steps to more fully implement the relational model, they and other vendors are chasing their tails trying to implement a failed idea from decades past.
3. Greed. I recently read an article extolling the virtues of XML. In it the author claimed that companies are finding "XML enriches their information capabilities, it also results in the need for major systems upgrades." http://www.dbta.com/frontpage_archives/9-03.htmlInterestingly, the author does not define or quantify just how XML "enriches" anyone but the software and hardware vendors.
However you choose to look at it, the major vendors do not necessarily have your best interests at heart. I'm sure that when XML is finally recognized for the bad idea that it is, they will gladly help you clean up the mess...for a price.
Conclusion
Do not be fooled by the fuzzy language and glitzy marketing-speak.
As data management professionals you have a responsibility to safeguard your company's data and you can't possibly do that effectively if you don't know, or ignore, the fundamentals. Pick up An Introduction to Database Management Systems by Chris Date and Practical Issues in Database Management by Fabian Pascal and get yourself solidly grounded in sound data management principles.
The alternative? Spend your time riding the merry-go-round chasing after the latest industry fad, which happens to be last year's fad and so on... throwing money at vendors and consultants with each turn of the wheel.