Note: This article was originally written in 1999, and not much has changed.
So, what the heck is XML, anyway?
The amount of hype pushing XML (eXtensible Markup Language) is amazing,
even for the software industry. Bowstreet claims that its Business Web
Factory "Integrates business processes (your own and your partners')
seamlessly into customized business webs using XML". IBM has an XML
evangelist. Microsoft "supports" XML in Internet Explorer 5.
I knew it was bad when my wife got a taxi receipt with the following
message on it (I've saved it as a souvenir):
IF YOU DON'T HAVE NATIVE XML, YOU'RE BEING TAKEN FOR A RIDE.
TAMINO--SOFTWARE AG
XML isn't a programming language. XML doesn't have anything to do with
the web, in and of itself. It isn't really even a markup language.
Technically speaking, it is a meta-language, or a language language. It
is a set of guidelines for creating a markup language, or, more
specifically, guidelines for creating your own set of tags. According to
www.xml.com , "A markup language is a mechanism to identify structures
in a document. The XML specification defines a standard way to add
markup to documents."
Practically speaking, it is a data modeling language. It is a decent
(though not problem-free) way for organizations to specify their data
and object structures in a platform independent manner. Done correctly,
an organization could use XML to easily transfer data between different
software and hardware vendors.
It's just very, very, very hard to do it correctly.
A bit of background. The grandfather of markup languages is SGML
(Standard Generalized Markup Language). SGML has been around for quite
awhile, and has been used successfully. However, its complexity
creates some programming difficulties.
Meanwhile, the HyperText Markup Language (HTML) was enjoying
widespread use, mainly as a browser programming language. Even though
it was an instantiation of SGML, it suffered from the opposite
problem: it was too limited for what people wanted to do. So, XML was
born as a compromise.
XML is closer to SGML than HTML in that it is extremely generic. XML
doesn't have tags itself; it just tells you how to make tags. A
slightly humorous example of this appeared in a PDF developer mailing
list that I subscribe to. One person asked if there were any tools to
convert PDF to XML. After explaining the apples/oranges thing, another
subscriber created a well-formed XML document:
<?xml version="1.0"?> <pdf> (the binary .PDF file) </pdf>
The above is a legitimate conversion of PDF to XML. It is, however,
completely and utterly useless.
To use XML well, a Document Type Definition (DTD) needs to be created.
The DTD is where the tags that comprise the document are defined, as
well as the relationship of one tag to another. It is where the data
model is created, and where most of the difficulties of using XML to
transfer data are.
In many ways, the data model is the guts of any application. The reasons
why the software you are using exists can be found in the data model.
Modeling how one piece of data relates to another is akin to determining
what frame a car will have. The ramifications of the car's width
determine how heavy it will be, which determines the engine and
transmission it will have, and so on.
To give a real life example, I work for Novartis Pharmaceuticals,
developing software for clinical trials. In clinical trials, which are
really massive, outrageously expensive experiments, we start out with a
compound. A compound is a chemical that may or may not have medicinal
use. Aspirin is a compound, as is Tamoxifen, and penicillin, and so on.
You can use compounds for several reasons, or indications. For example,
you can use aspirin to get rid of a headache, or you can use aspirin to
prevent heart attacks. A compound plus an indication is called a
project.
Now, to test whether aspirin prevents heart attacks, an experiment is
designed, which we call a trial. There can be several ways to test
whether aspirin prevents heart attacks.
In data modeling speak, compounds, projects, and trials are entities.
These entities have attributes. A compound has a chemical name, a trial
has a visit schedule. At Novartis, a project has an attribute called the
galenical aspect. This refers to the method of delivering the compound:
a shot, a pill, in toothpaste, whatever. In our model, a project must
have one and only one delivery method. Let's say, for sake of argument,
that Merck allows a project to have more than one compound delivery
method.
Now, let's say that Novartis decides to publish a clinical trial DTD
which allows a project to have only one delivery method, and gets its
DTD accepted as the industry standard. Implementation of Novartis's
standard could then result in Merck having to change its clinical trial
budgeting process, as different trials are moved into different projects
because one gives the patients aspirin in pill form, and the other gives
the patients suppositories.
That is a trivial example. Just wait until you get to things like
Adverse Events (side effects), or patient demographics, or medical
history; you are guaranteed to find some fundamentally irreconcilable
points of view, neither of which are wrong. What information is a trial
investigator required to get from a patient? Not a trivial issue.
Ok, so let's say that you've managed to get in house or external
competitors to cooperate, got all those thorny issues settled, and are
ready to use XML to transfer data. How are you going to look at it? It's
not really practical to use a text editor to view the XML, as it is all
marked up. It would be like looking at HTML source, only more so. You've
got to write some XSL (eXtensible Style Sheet) code, which is a
programming language for an XML=>HTML converter. That'll take you some
time, both in the writing, and the execution. It'll really be fun to
look at if you have a slow connection.
Now that you've defined it, and now that you can look at it, you're
probably going to want to do something with it (if all you want to do
is look at it, you could have kept it in HTML to start with). Perhaps
you'll want to convert your existing processes to handle XML, or
create whole new ones from scratch.
Whatever.
Not that XML itself is a perfect way to transfer data. While it's
generally good enough, there are some flaws that show up in the real
world.
The first is that it creates a hierarchical database. In the example
above, a compound would have one or more projects, which would have one
or more trials, etc. However, almost all client server and thin
client applications use relational databases, where data is stored in
tables with rows and columns. They're not the same shape. To store XML
data in a relational database, it'll have to be converted back and
forth. Not a big problem, but yet another obstacle to overcome.
A more serious problem is the file size. Again, according to
www.xml.com, "Terseness in XML markup is of minimal importance." Well,
maybe in a George Gilder dream, but in the real world, file size does
matter. Alot. Bandwidth is not free, and XML is definitely not terse.
For example, the SAS Institute is considering using XML to store their
datasets. However, the FDA sets a 25 Meg file size limit on datasets in
a new drug submission. Currently, usually only lab datasets are affected
by this limit. If XML is used, many more datasets would hit this barrier
and have to be split up, causing much angst and gnashing of teeth.
Data transfer standards are fine, and XML is as good a choice as any.
Just don't think that it won't take its pound of flesh, just like every
other technology known. My guess is that it will be one of those
technologies of the future that always remain so.