Tuesday, March 3, 2009

How can you screw up XML?

XML is about the simplest thing around. How can you screw it up?

I'm writing some metadata-extraction-from-image code and started looking at Adobe's XMP.

XMP is written using the W3's RDF spec - [almost wrote RDF framework, but that would then be Resource Description Framework framework (RDF2), which might not be the same thing].

RDF defines a 'Framework' for writing machine parseable statements of the form:

Tim has a bike.

RDF calls:

  • Tim the subject

  • has the predicate

  • bike the object
It makes you write URI's for both the subject and predicate - and have translation syntax between real words and the URI's.

Here's an example from the RDF Primer. It says: 'http://www.example.org was created on August 16, 1999'
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
<rdf:Description rdf:about="http://www.example.org/index.html">
<exterms:creation-date>August 16, 1999</exterms:creation-date>
The RDF is 5.9 times LONGER than the English Language Sentence. That's an increase
in text of about 83%. Or, to put it another way, a BANDWIDTH UTILIZATION of about 17%.

For What Gain?

Nothing. And it takes them 6 LONG RFC style documents to define this messs. That's 6 Long, Boring documents with much repetition and pedantic phrasing with many MUST's and SHALL's and MAY's.

But it can be parsed by a machine - if you can understand the spec well enough to write the code.

Why take something so simple and make it so incomprehensibly complex?

But - Believe it Not - I digress.

XMP is written using RDF [why re-invent the wheel when you can use somebody else's debacle and make it worse].

I'm not going to get into XMP - but at first glance it looks like they're using attributes for data, XML entities for data, and RDF nested structures for data - with NO obvious logic as to when and where these choices are made. To further mess it up, everything uses XPATH name spaces - which my XML parser translates back to URI's [which point to nothing, but are long and look cool], like a 'good parser should' - which obfuscates the already obfuscated and bulks out the fluff to content ratio admirably.


Here's how I think XML intended to encapsulate a website's creation date:
<site-info site_name="www.example.com">
<creation-date>August 16, 1999</creation-date>
It's machine parse-able. It's (almost) human readable. It only wastes 50% of the bandwidth - as opposed to 85% using RDF.

How about JSON - where the entire spec fits on one web page:
"site_info": {
"site": "www.example.org",
"creation_date": "August 16, 1999"

No comments: