body { margin:0; padding:0; font-family:times; font-size:9pt; } p { margin-top:5; margin-bottom:5; font-weight:plain; font-family:serif; font-size:9pt; } h1 { margin-top:5; font-size:16pt; font-weight:bold; text-align:center; font-family:sans-serif; } h2 { margin-top:20; font-size:10pt; font-weight:bold; text-align:left; font-family:sans-serif; } h3 { margin-top:20; font-size:9pt; font-weight:bold; text-align:left; font-family:sans-serif; } pre { color:#993333; font-size:9pt; } code { color:#993333; } .section { text-align:justify; padding:5; } .note { font-size:8pt; font-style:italic; color:#000000; padding-left:5; padding-right:5; padding-top:5; padding-bottom:5; background-color:#CCCCCC }
XML seems to have attracted almost a religious following. And like most religions this means that it is followed with unquestioning loyalty.
So let's Cut The Crap!
XML is a text based data representation syntax. Period.
It can be fairly productively used in a variety of data representation situations, and has
become the de-facto standard for many purposes.
It is used in two fundamental ways.
The first XML parsers would convert source XML data into a DOM. An API to the
DOM then provided programmers with convenient functions to manipulate the "object model" of the
source XML.
This "methodology" was sufficient to exploit XML in many systems. It could be demonstrated how useful it could be, particularly as a method of common data exchange between systems.
The deal here, is not that XML specifically was a good idea, but that some agreed data format should serve as a common method to externalize/exchange data.
After exposure to HTML the XML syntax was found easily acceptable
and moreover was supported by major industry players.
From a technical perspective, XML can easily be bettered by LISP
syntax - developed in 1957. For those interested in why LISP is technically
superior, you have only to consider the structural restrictions on XML attributes
to understand its unsuitability in generic data representation.
Any further doubts should be removed by looking at XSLT :-)
Of course, once initial demonstrations and prototypes have shown how XML DOMs could be used everyone got carried away.
The XML documents got bigger and bigger. DOM-based processing itself is now a problem, since the entire XML source document had to be processed. And once processed, held in memory, while the application retrieved data using the DOM API.
Having decided that data should be represented using XML, a new approach was needed to be able to scalably process large XML files.
SAX parsers provided an event-based solution that helped applications to parse source XML. This was a solution to those applications that only required transient access to the XML data - for example to make a function call, or update a database.
SAX does not however help those applications that have become dependent on a DOM-based data model.
Even for those applications where SAX is relevant it brings its own problems. Programmers are now responsible for managing the interpretation of the low-level SAX events.
Essentially SAX-based processing requires maintenance of potentialy complex intermediate data structures before any resulting processing can be actioned.
For those applications that require DOM access, SAX is no solution at all. Claims that "memory is cheap" and suggestions of gigabyte java virtual memory is no scalable solution.
Yeah, I know that there is some other system out there called MiniDOM - but what the hell.
MiniDOM directly addresses the problem that applications have when processing XML using SAX parsers - building intermediate data structures.
After writing a few SAX based applications I had developed a number of useful processing patterns - none of them original I am sure. Typically I would create some new context when "starting" a specific tag, and subsequent SAX events were interpreted by that context. Contexts could be nested scalably and were pushed and popped according to the XML structure.
Taking a few steps back, I could see that what I was building were "mini object models" that were processed when they were complete.
MiniDOM takes this generic idea and builds "mini" DOMs from SAX events.
The programmer registers handlers with either specific tags, or - just to be cute - with an
XPath expression.
When the handler is triggered, a MiniDOM Element is passed to the handler, that
can then directly access all the data within the element.
This hugely simplifies SAX processing.
MiniDOM is available under LGPL. Checkout the whitepaper here.
PDOM solves the other XML processing problem. Where some huge XML file imposes resource problems for an application needing access to the DOM.
There are two main problems.
One problem is when a large XML file is rarely changed, but that everytime it is accessed there is an unacceptable delay in processing the file before the DOM is available.
PDOM solves this problem by enabling the file to be processed a single time, and thereafter providing timely access to the DOM representation.
The second problem is the resource requirement of maintaining the DOM in a java VM.
A large DOM will require a large java virtual memory space. While access to data in the DOM may be fast, the large virtual memory space may cause significant problems with garbage collection performance - although the generational models now used will mitigate against this.
But, in general, just insisting on the availability of larger VMs is no solution. It is a well known adage that for any performance problem you should look first to software rather than hardware.
PDOM stands for "Persistent" DOM, and provides a W3C compliant DOM API to data that is stored persistently (essentially in a file somewhere). The expectation may be that this will lead to performance problems but experience does not confirm this. PDOM is based on a mature and well tested persistent object technology.
The underlying technology uses object caching and indexing to provide access to the objects requested.
In addition to the DOM API some extra methods are provided that more scalably address
access to a persistent model, for example, in addition to implementing the getChildNodes
method, that returns a NodeList (that in turn provide indexed access to the child nodes),
a method getChildNodesIterator returns a structure that can be used more efficiently
to iterate through all nodes. This is because the underlying represention uses a persistent
linked-list, that will not provide efficient index access.
As well as providing a scalable Persistent DOM, PDOM provides an interface to store multiple persistent XML documents in the same datastore.
These document objects - that implement the W3C Document interface are named for
later retrieval.
If the purpose in using PDOM is simply to solve the memory resource issue when loading and using a huge DOM, then the option to back the PDOM with a temporary file will ensure the system does not leave unnecessry files lying around.
Support for XPath provides flexible styles of interaction for those that would rather not use the DOM API directly.
The id() function has a relaxed semantics in PDOM.
It is defined in the XPath standard as returning a NodeSet and
to take a "string" value. This "string" is either a single element "id" or a space delimited
list of "ids", such as "id1 id2 id3".
What a load of nonsense!
The "Cut The Crap" implementation of id() ignores the "space delimited"
interpretation on principle (tho' like most principles this one too can be modified).
An additional possibility tho' is that the parameter may be computed. This relatively minor
enhancement qualitively changes the possible expressions to support navigation around the
"net" of xml associations, consider:
<business id="northwind">
<product id="widget" description="Just a widget"/>
<order id="ORD-123">
<order-item product="widget"/>
</order>
</business>
You can extract the product descriptions using the following query:
Iterator res = pdocument.queryXPath("id('ORD-123')/order-item/id(@product)/@description");
This minor change now enables more general XML navigation.
PDOM does not have to be used only when the DOM would be huge. It can be used for any DOM.
The underlying object technology can even be used for in-memory models only. In fact the representation techniques used in PDOM compare well efficiency-wise with other in-memory DOM implementations.
As your DOM becomes bigger, you have a simple option to configure the PDOM to utilize a persistent backing store.
PDOM uses the Generic Persistent Object Model that can be downloaded from
here,
other information on GPO can be found on the main
Cut The Crap site, and a short whitepaper on GPO
here.
Since PDOM is dependent on GPO it requires a licensed version of the underlying
GPO system to remove time-limited backingstore updates. PDOM is included in the
standard "Cut The Crap" download available
here.
A separate whitepaper on PDOM is available here.
I believe that there is a large class of users out there that simply want to access XML data as quickly as possible. They do not want to update the DOM, they simply want to read the data.
It is for these users that I have developed QUDOM. The design aim being to parse the raw XML as quickly as possible, and provide a read-only DOM that minimise java VM resources.
The result to date is a DOM implementation that parses XML files over twice as fast as Xerces and uses less than a quarter of the memory to store the DOM.
For example, on a PIII 1Ghz laptop, a 20Mb XML file is parsed in less than 2 seconds and requires only 11Mb of java VM storage.
This performance is achieved by implementing a native XML parser - QParser - that provides an alternative
to the standard SAX content handler interface. Instead it defines a number of new callback
methods based around direct buffer access. Essentially the parser reads the XML data into a
circular buffer, and makes callbacks that reference the buffer contents - avoiding the creation
of intermediate objects.
The main callbacks use the following method signature:
callback(byte[] buf, int start, int len1, int len2);
This should be interpreted to mean that the data begins at offset start for
length len1 and then continues at offset 0 for length len2.
For the majority of calls len2 will be zero.
Providing attributes and tagnames are less than 4096 characters there should be
no problems.
Character data - such as text and comment nodes, take an additional
boolean parameter to indicate whether the callback is a continuation of the previous callback,
in this way large text or comment blocks can be processed efficiently.
The QDOM object implements the QParser.Handler interface and
provides the read-only DOM implementations.
QDOM processes the parsed data and writes to a structured data buffer. Fast custom
String internalization minimises memory usage for tags and attributes.
QDOM implements the W3C DOM Document interface.
The DOM implementation objects are transient wrappers around the parsed data, processing the data lazily whenever requested.
The only drawback to this approach is that objects do not retain their identity:
Node c1 = node.getChildNodes().item(i); Node c2 = node.getChildNodes().item(i);
c1 != c2 but c1.equals(c2)
Any call to updateable DOM methods will throw a NO_MODIFICATION_ALLOWED
DOMException.
QUDOM is included in the full Cut The Crap software distribution and can be downloaded from here.
QUDOM does not rely on the Generic Persistent Object model and is functionally unrestricted. It is realeased as "shareware" and users are encouraged to make a shareware contribution at the main payment page.
A separate whitepaper on QUDOM is available here.