body { margin:0; padding:0; font-family:times; font-size:9pt; } p { margin-top:5; margin-bottom:5; font-weight:plain; font-family:serif; font-size:9pt; } h1 { margin-top:5; font-size:16pt; font-weight:bold; text-align:center; font-family:sans-serif; } h2 { margin-top:20; font-size:10pt; font-weight:bold; text-align:left; font-family:sans-serif; } pre { color:#993333; font-size:9pt; } code { color:#993333; } .section { text-align:justify; padding:5; }
MiniDOM is an XML parsing utility that can dramatically simplify your SAX parsing tasks.
It does this very simply and straightforwardly. The utility is available under GNU LGPL with source code if you are interested.
Well, the reason why SAX was needed is what is wrong with DOM.
Or more helpfully, a DOM parser has to build an object model of the whole document. If the document is very large, it may just not be practical. It is accepted that DOM is slow and unscalable compared to a SAX-based approach.
SAX on the other hand requires considerably more expertise on the part of the programmer.
This is because a SAX parser fires off events to registered handlers as XML elements are
parsed. Events such as startElement and endElement. A SAX parser
does the minimum necessary - it parses the raw XML and triggers methods on handlers passing
in structured content, like a list of all the attributes with startElement.
Furthermore, a validating SAX parser will check the XML against a specification if one is provided.
The problem with SAX though, is that it does not provide any more structural information than this raw XML element level. In order to process a document using a SAX parser, the application needs to maintain state and context data in order to interpret the new data as it arrives.
This is not, of course, an impossible task. Several design patterns can be used to make it quite manageable. But it is nevertheless a pain.
Any analysis of SAX based parsing applications reveals that they spend much time building what could be considered to be specialized portions of a DOM model.
Typically on some startElement event, some new context will be created and filled
in by subsequent events until a matching endElement event will trigger the final
computation. This is extremely common.
MiniDOM specifically addresses this common pattern.
A MiniDOM object uses a SAX parser and supports registered handlers of specific tags.
When the tag is processed, handlers are called with an Element object representing the node of interest:
public class HandlerApp {
public void processFile(String fname) {
MiniDOM md = new MiniDOM();
md.register("TAG1", new MiniDOM.IHandler() {
public void handleIt(Element elem) {
processTAG1(elem);
}
} );
md.register("TAG2", new MiniDOM.IHandler() {
public void handleIt(Element elem) {
processTAG2(elem);
}
} );
md.process(new FileInputStream(fname));
}
public processTAG1(Element elem) {
//...
}
public processTAG2(Element elem) {
//...
}
}
In the example, anonymous inner classes are used to redirect to the specified methods. In theory for more complex tasks, different objects might be registered. I have not however found any XML processing tasks for which this simple approach has not been sufficient.
If a document contains 10 or 10 million such elements, there is not the DOM disadvantage of building the whole document model, but rather each element is delivered for processing at the granularity required.
Moreover, if at any point no handlers would be active, then no unnecessary data objects are built and retained.
There are some situations when a larger context is required. For example, a containing XML element may specify attributes needed to process contained elements correctly.
For this reason, MiniDOM specifies the IPeek interface:
public interface IPeek {
public void peek(Element elem);
}
Any registered IPeek handlers are called on the startElement
event and registered IHandler handlers on the endElement event.
The kind of thing that MiniDom is very suitable for is where an xml
source file has a number of composite nodes - or nodes that do not simply contain
attribute data.
Consider the following :
<customers>
<customer lastname='Brown' firstname='John' telno='02079250918'>
<address postcode='POS TCODE'>
<line>XML Processing</line>
<line>Not So Easy Street</line>
<line>Someplace</line>
</address>
</customer>
...
</customers>
Suppose that the source file is called customers.xml and that
the task is to generate an html file without bringing in any XSLT
experts.
The objective is to make each record look something like this:
Address: XML Processing, Not So Easy Street, Someplace, POS TCODE
Telephone: 02079250918
Here is a java method that will do something sensible:
public void transform(String inxml, String outhtml) {
FileOutputStream outstr = new FileOutputStream(outhtml);
final PrintWriter f_outp = new PrintWriter(outstr);
f_outp.println("<html><title>Customer List</title><body>");
f_outp.println("\t<h1>Customer List</h1>");
MiniDOM md = new MiniDOM();
md.register("customer", new MiniDOM.IHandler() {
// example output
// <h2>John Brown</h2>
// <p><b>Address: </b>XML Processing, Easy Street, Someplace, POS TCODE</p>
// <p><b>Telephone: </b>02079250918</p>
public void handleIt(Element elem) {
f_outp.println("\t<h2>"
+ elem.getAttributeValue("firstname")
+ " "
+ elem.getAttributeValue("lastname")
+"</h2>");
f_outp.print("\t<p><b>Address: </b>");
Element addr = elem.getElementByTag("address")
Iterator lines = addr.getElementsByTag("line");
while (lines.hasNext()) {
Element line = (Element) lines.next();
f_outp.print(line.getValue() + ", ");
}
f_outp.println(addr.getAttributeValue("postcode") + "</p>);
f_outp.println("\t<p><b>Telephone: </b>" + elem.getAttributeValue("telno"));
}
} );
md.processFile(new FileInputStream(inxml));
outp.println("