Tuesday, March 3, 2009

How can you screw up XML? Part II

XML is overly wordy. It has lots of required text.

I think the theory is - if there really is one - that longer words make the computer-code more easily understood. That was certainly true when we were using FORTRAN with 6 character variable names and Basic with 2 characters - but there's a limit. That limit is probably around 32 characters - 1/3 of a screen line - at which point you need some white space and punctuation.

XML doesn't provide for white space, just punctuation. As (semi-)intelligent creatures, we need white space to clump symbols into recognizable things. It seems to be how we parse. Computers don't distinguish, so . . . They can read XML, but we can't. [maybe it's really a plot by the machines and W3 are a bunch of Cylons?]

So maybe the extra stuff is supposed to make XML more reliable? To get there, we have to talk about two slightly different subjects: Bandwidth and Information Theory.

Bandwidth measures how fast we can transmit messages. The bigger the bandwidth, the faster the electronic wiggles are and so the more Bits we can represent in a given chunk of time - say a second. Big Bandwidth is Good.

Effective Bandwidth is the fraction of the real Bandwidth you get to use for your stuff - the content you want to see, transmit, or use - read streaming video from hulu.com. The more non-Content characters in the Protocol [read XML] used to encode your 'stuff', the lower your Effective Bandwidth. [You pay for Real Bandwidth, but you Get Effective Bandwidth - it's kind of like sales tax or Net Income after Income Tax]

Information Theory studies how to transmit Information in the presence of Noise. It turns out that you can always get your message across accurately - most of the time - if you use a fancy enough code. A Code takes a simple message and adds a lot of extra bits which allow the receiver to tell if the message was messed up [received a 0 where there should have been a 1] and to reconstruct it. When you have more noise, you have to add more reconstruction bits. That makes the message longer. So Information Theory says - if you want reliable communication, you have to allocate some of your Bandwidth to these reconstruction bits - called Redundancy - so that your Effective Bandwidth is lower than your actual Bandwidth.

Now let's apply this to XML.

XML adds extra stuff to create a rigid structure for you message. This has Nothing to do with Information Theory because XML is transferred over TCP - which is a Lossless Protocol - meaning that, if the message gets there At All, it's guaranteed to be OK. There's No Noise on TCP. [The TCP protocol has already eaten up the Bandwidth required by Information Theory to get your 'stuff' to you]

The XML extra stuff is there so computer programs can easily parse the message and use it's Content once it gets there.

XML's 'extra stuff' shares the Same Bandwidth as the Content inside the XML message.

I ask you: what's more important: the Content or the 'extra stuff'?

Personally, I think the 'extra stuff' should be as small and efficient as possible so we can use as much of the Bandwidth for Content.

W3 must think that the 'extra stuff' is more important than content because they make their protocols as bulky as they can.

Don't believe me? go to www.w3.org and read some of their specs - try name spaces or the RDF spec or just about anything. All Structure with minimal space for Content.

Why do we put up with this?

How can you screw up XML?

XML is about the simplest thing around. How can you screw it up?

I'm writing some metadata-extraction-from-image code and started looking at Adobe's XMP.

XMP is written using the W3's RDF spec - [almost wrote RDF framework, but that would then be Resource Description Framework framework (RDF2), which might not be the same thing].

RDF defines a 'Framework' for writing machine parseable statements of the form:

Tim has a bike.

RDF calls:

  • Tim the subject

  • has the predicate

  • bike the object
It makes you write URI's for both the subject and predicate - and have translation syntax between real words and the URI's.

Here's an example from the RDF Primer. It says: 'http://www.example.org was created on August 16, 1999'
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
<rdf:Description rdf:about="http://www.example.org/index.html">
<exterms:creation-date>August 16, 1999</exterms:creation-date>
The RDF is 5.9 times LONGER than the English Language Sentence. That's an increase
in text of about 83%. Or, to put it another way, a BANDWIDTH UTILIZATION of about 17%.

For What Gain?

Nothing. And it takes them 6 LONG RFC style documents to define this messs. That's 6 Long, Boring documents with much repetition and pedantic phrasing with many MUST's and SHALL's and MAY's.

But it can be parsed by a machine - if you can understand the spec well enough to write the code.

Why take something so simple and make it so incomprehensibly complex?

But - Believe it Not - I digress.

XMP is written using RDF [why re-invent the wheel when you can use somebody else's debacle and make it worse].

I'm not going to get into XMP - but at first glance it looks like they're using attributes for data, XML entities for data, and RDF nested structures for data - with NO obvious logic as to when and where these choices are made. To further mess it up, everything uses XPATH name spaces - which my XML parser translates back to URI's [which point to nothing, but are long and look cool], like a 'good parser should' - which obfuscates the already obfuscated and bulks out the fluff to content ratio admirably.


Here's how I think XML intended to encapsulate a website's creation date:
<site-info site_name="www.example.com">
<creation-date>August 16, 1999</creation-date>
It's machine parse-able. It's (almost) human readable. It only wastes 50% of the bandwidth - as opposed to 85% using RDF.

How about JSON - where the entire spec fits on one web page:
"site_info": {
"site": "www.example.org",
"creation_date": "August 16, 1999"