Why And How To Use QUDOM

Author: Martyn Cutcher

email martyn@cutthecrap.biz

website www.cutthecrap.biz

What is QUDOM?

QUDOM stands for Quick Unmodifiable Document Object Model.

QUDOM implements the W3C DOM API, but will throw a QNoModificationException if any methods are used that are intended to modify the data.

The only way that data can be added to QUDOM is via the QDOM constructor that accepts an InputStream of xml data.

Is it straightforward to use?

Here is some example java code:

import cutthecrap.qudom.*;

QDOM qd = new QDOM(new FileInputStream("D:/testxml/opera.xtm"));
QDocument doc = qd.getDocument();

The QDocument class implements the org.w3c.dom.Document interface.

What Is The Point of QUDOM

QUDOM is targetted at users who need quick, low cost access to xml data. It is optimised in two areas. First it includes a simple and efficient parser.

QParser defines a "SAX-like" handler interface. Here it is:

public interface Handler {
  public void startElement(byte[] buf, int start, int len1, int len2);
  public void attribute(byte[] buf, int stName, int nlen1, int nlen2, 
                                    int stValue, int vlen1, int vlen2 );
  public void endElement();
  public void text(byte[] buf, int start, int len1, int len2, boolean continuation);
  public void comment(byte[] buf, int start, int len1, int len2, boolean continuation);
  public void startDocument();
  public void endDocument();
}

The methods provide direct access to the data in the input buffer. The extra arguments allow for the buffer to be refilled cyclically - it is divided in two halves for bulk reading from the input stream. This method avoids creation of intermediate objects and other buffering.

The parsing process is subsequently significantly faster than other java parsers.

The second design aim of QUDOM is to store the data efficiently in-memory.

This is achieved by writing the xml data into a "structured" buffer. This buffer is simply an array of byte arrays. Types such as numbers and strings can be built by reading the data from a given offset. This utilizes simple but effective compression approaches developed originally within the Generic Persistent Object Model.

The result is that QDOM requires around one fifth the java memory resources of Xerces. With associated parse time reduction.

Source XML: 5.2Mb        Java Memory      Parse Time 
-------------------------------------------------------
Xerces                     20.1Mb           2400ms
QUDOM                       4.3Mb            700ms

The trade-off is that QUDOM may be Quick, but it is also Unmodifiable. But for a large number of users who simply want to be able to access external xml data, modification is not of any interest.

Transient, Value-based Data

One of the main features of an object-oriented is that objects have identity.

QUDOM does not support object identity. The objects it returns that implement the W3C DOM api are transient objects that interpret the data at a specific offset in the QUDOM buffer.

The upside of this restriction, is that the memory requirements of QUDOM are significantly reduced. The "transient" objects are created when needed to provide the data interface and can be garbage collected once unreferenced.

That QUDOM is read-only is the reason why we can get away with a "value" based approach.

To compare two QUDOM objects, it is necessary to use the equals() method since == is not likely to be true.

The QNode implementation of equals() equates the identifying address (offset) of the two QNodes. To equate "values" you must retrieve those yourself, eg:

  attr1.getValue().equals(attr2.getValue());

Using QUDOM

Once you have parsed the source XML data, you may then retrieve the QDocument object:

QDocument doc = qdom.getDocument();

Element el = doc.getElementById("someId");

Element r = doc.getDocumentElement();

...and use the standard W3C DOM api to navigate the data.

Runtime Performance

All of the emphasis to date has been on minimising initial parse times and memory usage.

No undue effort has however been put even into this task, nevertheless the results seem pretty good so far!

The use of Transient objects and continual processing of the "state" buffer will clearly have some impact on the "runtime" performance, that is the time taken to access the data within the model.

To test this I used a simple "deep iteration" of the DOM nodes with the following jython:

def timeDeep(e) :
  st = System.currentTimeMillis();
	
  els = deepIterate(e);
	
  print "full recurse took", System.currentTimeMillis() - st, "millisecs for", els, "elements";

	
def deepIterate(e) :
  ct = 0;
  c = e.getFirstChild();
	
  while c != None :
    ct = ct + 1;
    ct = ct + deepIterate(c);
    c = c.getNextSibling();
	
  return ct;

I ran the test twice, once for QUDOM and once for Xerces. For this test I thought QUDOM would do well to be within an order of magnitude of the Xerces performance, these were the results:

For QUDOM

full recurse took 4346 millisecs for 118248 elements

For Xerces

full recurse took 10795 millisecs for 230155 elements

This seemed definitely on the slow side, so I re-implemented the test in java.

With the native java control the figures improved considerably, QUDOM now completed the full recurse in 350 milliseconds, and Xerces a much improved 300 milliseconds.

So QUDOM is a little slower than Xerces in processing the structures once loaded. The QUDOM implementation has not been optimised in the slightest and several techniques are available to dramatically improve its performance, but for now I consider that it performs competitively.

Performance Update

Some analysis and minor modifications has now improved the full recurse time for QUDOM to 190ms with subsequent runs reducing the time to 150ms presumably due to hotspot optimisations.

Xerces initial recurse remains at 300ms but it was noted that subsequent runs were much faster at around 60ms. Unfortunately this performance comes at a price since Xerces memory requirements increase by 50% after the benchmark, whilst QUDOM remains unchanged.

The majority of the CPU time for QUDOM in the full recurse benchmark is in the creation of the temporary objects. There may be some scope in the future to reduce this significantly once usage patterns are better understood.

XPath Support + id() Extension

The XPath support provided in PDOM is also available to QUDOM, a small extension to the standard W3C DOM api is defined that allows a generic XPath processor to navigate each models efficiently.

This includes the id() extension to the standard XPath interpretation that allows the XML nodes to be navigated using attribute values as "foreign" keys:

Iterator res = doc.queryXPath("id('ord-123')/order-item/id(@product)@description");

Summary

QUDOM address the requirement for quick efficient access to read-only XML data.

For really huge files (> 100Mb) a persistent solution such as PDOM is recommended, but for medium sized XML data sources it provides an excellent solution.

How Can I Get QUDOM?

QUDOM is provided as part of the full Cut The Crap distribution and can be downloaded from www.cutthecrap.biz/software/downloads.html along with other Cut The Crap software.

Is It Free To Use?

QUDOM is released as "shareware". It is not restricted in any way. Developers are of course encouraged to visit the payments page and pay for a license.

Purchase of a full license for the Cut The Crap Software will remove any obligation to make an additional shareware contribution.