Cutting The Crap On XML

Author: Martyn Cutcher

email martyn@cutthecrap.biz

website www.cutthecrap.biz

XML seems to have attracted almost a religious following. And like most religions this means that it is followed with unquestioning loyalty.

So let's Cut The Crap!

What Is XML?

XML is a text based data representation syntax. Period.

It can be fairly productively used in a variety of data representation situations, and has become the de-facto standard for many purposes.

How Is XML Used?

It is used in two fundamental ways.

The Document Object Model - DOM

The first XML parsers would convert source XML data into a DOM. An API to the DOM then provided programmers with convenient functions to manipulate the "object model" of the source XML.

This "methodology" was sufficient to exploit XML in many systems. It could be demonstrated how useful it could be, particularly as a method of common data exchange between systems.

The deal here, is not that XML specifically was a good idea, but that some agreed data format should serve as a common method to externalize/exchange data.

After exposure to HTML the XML syntax was found easily acceptable and moreover was supported by major industry players.

From a technical perspective, XML can easily be bettered by LISP syntax - developed in 1957. For those interested in why LISP is technically superior, you have only to consider the structural restrictions on XML attributes to understand its unsuitability in generic data representation.

Any further doubts should be removed by looking at XSLT :-)

Scalability Issues With DOM

Of course, once initial demonstrations and prototypes have shown how XML DOMs could be used everyone got carried away.

The XML documents got bigger and bigger. DOM-based processing itself is now a problem, since the entire XML source document had to be processed. And once processed, held in memory, while the application retrieved data using the DOM API.

SAX To The Rescue

Having decided that data should be represented using XML, a new approach was needed to be able to scalably process large XML files.

SAX parsers provided an event-based solution that helped applications to parse source XML. This was a solution to those applications that only required transient access to the XML data - for example to make a function call, or update a database.

SAX does not however help those applications that have become dependent on a DOM-based data model.

Even for those applications where SAX is relevant it brings its own problems. Programmers are now responsible for managing the interpretation of the low-level SAX events.

Essentially SAX-based processing requires maintenance of potentialy complex intermediate data structures before any resulting processing can be actioned.

Huge DOMs

For those applications that require DOM access, SAX is no solution at all. Claims that "memory is cheap" and suggestions of gigabyte java virtual memory is no scalable solution.

A SAXy DOM - MiniDOM

Yeah, I know that there is some other system out there called MiniDOM - but what the hell.

MiniDOM directly addresses the problem that applications have when processing XML using SAX parsers - building intermediate data structures.

After writing a few SAX based applications I had developed a number of useful processing patterns - none of them original I am sure. Typically I would create some new context when "starting" a specific tag, and subsequent SAX events were interpreted by that context. Contexts could be nested scalably and were pushed and popped according to the XML structure.

Taking a few steps back, I could see that what I was building were "mini object models" that were processed when they were complete.

MiniDOM takes this generic idea and builds "mini" DOMs from SAX events.

The programmer registers handlers with either specific tags, or - just to be cute - with an XPath expression.

When the handler is triggered, a MiniDOM Element is passed to the handler, that can then directly access all the data within the element.

This hugely simplifies SAX processing.

MiniDOM is available under LGPL. Checkout the whitepaper here.

PDOM - A Persistent W3C DOM Implementation

PDOM solves the other XML processing problem. Where some huge XML file imposes resource problems for an application needing access to the DOM.

There are two main problems.

Processing the XML

One problem is when a large XML file is rarely changed, but that everytime it is accessed there is an unacceptable delay in processing the file before the DOM is available.

PDOM solves this problem by enabling the file to be processed a single time, and thereafter providing timely access to the DOM representation.

Using the DOM

The second problem is the resource requirement of maintaining the DOM in a java VM.

A large DOM will require a large java virtual memory space. While access to data in the DOM may be fast, the large virtual memory space may cause significant problems with garbage collection performance - although the generational models now used will mitigate against this.

But, in general, just insisting on the availability of larger VMs is no solution. It is a well known adage that for any performance problem you should look first to software rather than hardware.

PDOM stands for "Persistent" DOM, and provides a W3C compliant DOM API to data that is stored persistently (essentially in a file somewhere). The expectation may be that this will lead to performance problems but experience does not confirm this. PDOM is based on a mature and well tested persistent object technology.

The underlying technology uses object caching and indexing to provide access to the objects requested.

Extensions to DOM API

In addition to the DOM API some extra methods are provided that more scalably address access to a persistent model, for example, in addition to implementing the getChildNodes method, that returns a NodeList (that in turn provide indexed access to the child nodes), a method getChildNodesIterator returns a structure that can be used more efficiently to iterate through all nodes. This is because the underlying represention uses a persistent linked-list, that will not provide efficient index access.

A General XML Database

As well as providing a scalable Persistent DOM, PDOM provides an interface to store multiple persistent XML documents in the same datastore.

These document objects - that implement the W3C Document interface are named for later retrieval.

Temporary File Option

If the purpose in using PDOM is simply to solve the memory resource issue when loading and using a huge DOM, then the option to back the PDOM with a temporary file will ensure the system does not leave unnecessry files lying around.

XPath

Support for XPath provides flexible styles of interaction for those that would rather not use the DOM API directly.

The id() function has a relaxed semantics in PDOM. It is defined in the XPath standard as returning a NodeSet and to take a "string" value. This "string" is either a single element "id" or a space delimited list of "ids", such as "id1 id2 id3".

What a load of nonsense!

The "Cut The Crap" implementation of id() ignores the "space delimited" interpretation on principle (tho' like most principles this one too can be modified).

An additional possibility tho' is that the parameter may be computed. This relatively minor enhancement qualitively changes the possible expressions to support navigation around the "net" of xml associations, consider:

<business id="northwind">
  <product id="widget" description="Just a widget"/>
  <order id="ORD-123">
    <order-item product="widget"/>
  </order>
</business>

You can extract the product descriptions using the following query:

Iterator res = pdocument.queryXPath("id('ORD-123')/order-item/id(@product)/@description");

This minor change now enables more general XML navigation.

Always Use PDOM!

PDOM does not have to be used only when the DOM would be huge. It can be used for any DOM.

The underlying object technology can even be used for in-memory models only. In fact the representation techniques used in PDOM compare well efficiency-wise with other in-memory DOM implementations.

As your DOM becomes bigger, you have a simple option to configure the PDOM to utilize a persistent backing store.

How Do I Get PDOM?

PDOM uses the Generic Persistent Object Model that can be downloaded from here, other information on GPO can be found on the main Cut The Crap site, and a short whitepaper on GPO here.

Since PDOM is dependent on GPO it requires a licensed version of the underlying GPO system to remove time-limited backingstore updates. PDOM is included in the standard "Cut The Crap" download available here.

A separate whitepaper on PDOM is available here.

A Quick Unmodifiable DOM - QUDOM

I believe that there is a large class of users out there that simply want to access XML data as quickly as possible. They do not want to update the DOM, they simply want to read the data.

It is for these users that I have developed QUDOM. The design aim being to parse the raw XML as quickly as possible, and provide a read-only DOM that minimise java VM resources.

The result to date is a DOM implementation that parses XML files over twice as fast as Xerces and uses less than a quarter of the memory to store the DOM.

For example, on a PIII 1Ghz laptop, a 20Mb XML file is parsed in less than 2 seconds and requires only 11Mb of java VM storage.

This performance is achieved by implementing a native XML parser - QParser - that provides an alternative to the standard SAX content handler interface. Instead it defines a number of new callback methods based around direct buffer access. Essentially the parser reads the XML data into a circular buffer, and makes callbacks that reference the buffer contents - avoiding the creation of intermediate objects.

The main callbacks use the following method signature:

callback(byte[] buf, int start, int len1, int len2);

This should be interpreted to mean that the data begins at offset start for length len1 and then continues at offset 0 for length len2. For the majority of calls len2 will be zero.

Providing attributes and tagnames are less than 4096 characters there should be no problems.

Character data - such as text and comment nodes, take an additional boolean parameter to indicate whether the callback is a continuation of the previous callback, in this way large text or comment blocks can be processed efficiently.

The QDOM object implements the QParser.Handler interface and provides the read-only DOM implementations.

QDOM processes the parsed data and writes to a structured data buffer. Fast custom String internalization minimises memory usage for tags and attributes.

QDOM implements the W3C DOM Document interface.

Transient DOM Objects

The DOM implementation objects are transient wrappers around the parsed data, processing the data lazily whenever requested.

The only drawback to this approach is that objects do not retain their identity:

Node c1 = node.getChildNodes().item(i);
Node c2 = node.getChildNodes().item(i);

c1 != c2 but c1.equals(c2)

Read Only!

Any call to updateable DOM methods will throw a NO_MODIFICATION_ALLOWED DOMException.

How Do I Get QUDOM?

QUDOM is included in the full Cut The Crap software distribution and can be downloaded from here.

QUDOM does not rely on the Generic Persistent Object model and is functionally unrestricted. It is realeased as "shareware" and users are encouraged to make a shareware contribution at the main payment page.

A separate whitepaper on QUDOM is available here.