body { margin:0; padding:0; font-family:times; font-size:9pt; } p { margin-top:5; margin-bottom:5; font-weight:plain; font-family:serif; font-size:9pt; } h1 { margin-top:5; font-size:16pt; font-weight:bold; text-align:center; font-family:sans-serif; } h2 { margin-top:20; font-size:10pt; font-weight:bold; text-align:left; font-family:sans-serif; } pre { color:#993333; font-size:9pt; } code { color:#993333; } .section { text-align:justify; padding:5; }
QUDOM stands for Quick Unmodifiable Document Object Model.
QUDOM implements the W3C DOM API, but will throw a QNoModificationException
if any methods are used that are intended to modify the data.
The only way that data can be added to QUDOM is via the QDOM
constructor that accepts an InputStream of xml data.
Here is some example java code:
import cutthecrap.qudom.*;
QDOM qd = new QDOM(new FileInputStream("D:/testxml/opera.xtm"));
QDocument doc = qd.getDocument();
The QDocument class implements the org.w3c.dom.Document interface.
QUDOM is targetted at users who need quick, low cost access to xml data. It
is optimised in two areas. First it includes a simple and efficient parser.
QParser defines a "SAX-like" handler interface. Here it is:
public interface Handler {
public void startElement(byte[] buf, int start, int len1, int len2);
public void attribute(byte[] buf, int stName, int nlen1, int nlen2,
int stValue, int vlen1, int vlen2 );
public void endElement();
public void text(byte[] buf, int start, int len1, int len2, boolean continuation);
public void comment(byte[] buf, int start, int len1, int len2, boolean continuation);
public void startDocument();
public void endDocument();
}
The methods provide direct access to the data in the input buffer. The extra arguments allow for the buffer to be refilled cyclically - it is divided in two halves for bulk reading from the input stream. This method avoids creation of intermediate objects and other buffering.
The parsing process is subsequently significantly faster than other java parsers.
The second design aim of QUDOM is to store the data efficiently in-memory.
This is achieved by writing the xml data into a "structured" buffer. This buffer is simply
an array of byte arrays. Types such as numbers and strings can be
built by reading the data from a given offset. This utilizes simple but effective compression
approaches developed originally within the Generic Persistent Object Model.
The result is that QDOM requires around one fifth the java memory
resources of Xerces. With associated parse time reduction.
Source XML: 5.2Mb Java Memory Parse Time ------------------------------------------------------- Xerces 20.1Mb 2400ms QUDOM 4.3Mb 700ms
The trade-off is that QUDOM may be Quick, but it is also
Unmodifiable. But for a large number of users who simply want to be
able to access external xml data, modification is not of any interest.
One of the main features of an object-oriented is that objects have
identity.
QUDOM does not support object identity. The objects it returns that
implement the W3C DOM api are transient objects that interpret the
data at a specific offset in the QUDOM buffer.
The upside of this restriction, is that the memory requirements of QUDOM
are significantly reduced. The "transient" objects are created when needed to provide the
data interface and can be garbage collected once unreferenced.
That QUDOM is read-only is the reason why we can get away
with a "value" based approach.
To compare two QUDOM objects, it is necessary to use the equals()
method since == is not likely to be true.
The QNode implementation of equals() equates the identifying
address (offset) of the two QNodes. To equate "values" you must retrieve
those yourself, eg:
attr1.getValue().equals(attr2.getValue());
Once you have parsed the source XML data, you may then retrieve the
QDocument object:
QDocument doc = qdom.getDocument();
Element el = doc.getElementById("someId");
Element r = doc.getDocumentElement();
...and use the standard W3C DOM api to navigate the data.
All of the emphasis to date has been on minimising initial parse times and memory usage.
No undue effort has however been put even into this task, nevertheless the results seem pretty good so far!
The use of Transient objects and continual processing of the "state" buffer
will clearly have some impact on the "runtime" performance, that is the time taken to
access the data within the model.
To test this I used a simple "deep iteration" of the DOM nodes with the
following jython:
def timeDeep(e) :
st = System.currentTimeMillis();
els = deepIterate(e);
print "full recurse took", System.currentTimeMillis() - st, "millisecs for", els, "elements";
def deepIterate(e) :
ct = 0;
c = e.getFirstChild();
while c != None :
ct = ct + 1;
ct = ct + deepIterate(c);
c = c.getNextSibling();
return ct;
I ran the test twice, once for QUDOM and once for Xerces. For
this test I thought QUDOM would do well to be within an order of magnitude of
the Xerces performance, these were the results:
For QUDOM
full recurse took 4346 millisecs for 118248 elements
For Xerces
full recurse took 10795 millisecs for 230155 elements
This seemed definitely on the slow side, so I re-implemented the test in java.
With the native java control the figures improved considerably,
QUDOM now completed the full recurse in 350 milliseconds, and Xerces
a much improved 300 milliseconds.
So QUDOM is a little slower than Xerces in processing the
structures once loaded. The QUDOM implementation has not been optimised in
the slightest and several techniques are available to dramatically improve its performance,
but for now I consider that it performs competitively.
Some analysis and minor modifications has now improved the full recurse time for QUDOM
to 190ms with subsequent runs reducing the time to 150ms presumably due
to hotspot optimisations.
Xerces initial recurse remains at 300ms but it was noted that subsequent
runs were much faster at around 60ms. Unfortunately this performance comes at a price
since Xerces memory requirements increase by 50% after the benchmark, whilst
QUDOM remains unchanged.
The majority of the CPU time for QUDOM in the full recurse benchmark
is in the creation of the temporary objects. There may be some scope in the future to
reduce this significantly once usage patterns are better understood.
id() ExtensionThe XPath support provided in PDOM is also
available to QUDOM, a small extension to the standard W3C DOM api
is defined that allows a generic XPath processor to navigate each models efficiently.
This includes the id() extension to the standard XPath interpretation
that allows the XML nodes to be navigated using attribute values as "foreign"
keys:
Iterator res = doc.queryXPath("id('ord-123')/order-item/id(@product)@description");
QUDOM address the requirement for quick efficient access to read-only XML data.
For really huge files (> 100Mb) a persistent solution such as PDOM is recommended,
but for medium sized XML data sources it provides an excellent solution.
QUDOM is provided as part of the full Cut The Crap distribution and can be downloaded from www.cutthecrap.biz/software/downloads.html along with other Cut The Crap software.
QUDOM is released as "shareware". It is not restricted in any way. Developers are of course encouraged to visit the payments page and pay for a license.
Purchase of a full license for the Cut The Crap Software will remove any obligation to make an additional shareware contribution.