Why And How To Use MiniDOM

Author: Martyn Cutcher

email martyn@cutthecrap.biz

website www.cutthecrap.biz

What's MiniDOM?

MiniDOM is an XML parsing utility that can dramatically simplify your SAX parsing tasks.

It does this very simply and straightforwardly. The utility is available under GNU LGPL with source code if you are interested.

What's Wrong With DOM or SAX?

Well, the reason why SAX was needed is what is wrong with DOM.

Or more helpfully, a DOM parser has to build an object model of the whole document. If the document is very large, it may just not be practical. It is accepted that DOM is slow and unscalable compared to a SAX-based approach.

SAX on the other hand requires considerably more expertise on the part of the programmer.

This is because a SAX parser fires off events to registered handlers as XML elements are parsed. Events such as startElement and endElement. A SAX parser does the minimum necessary - it parses the raw XML and triggers methods on handlers passing in structured content, like a list of all the attributes with startElement.

Furthermore, a validating SAX parser will check the XML against a specification if one is provided.

The problem with SAX though, is that it does not provide any more structural information than this raw XML element level. In order to process a document using a SAX parser, the application needs to maintain state and context data in order to interpret the new data as it arrives.

This is not, of course, an impossible task. Several design patterns can be used to make it quite manageable. But it is nevertheless a pain.

Any analysis of SAX based parsing applications reveals that they spend much time building what could be considered to be specialized portions of a DOM model.

Typically on some startElement event, some new context will be created and filled in by subsequent events until a matching endElement event will trigger the final computation. This is extremely common.

MiniDOM specifically addresses this common pattern.

So What Is MiniDOM Precisely?

A MiniDOM object uses a SAX parser and supports registered handlers of specific tags.

When the tag is processed, handlers are called with an Element object representing the node of interest:

public class HandlerApp {

  public void processFile(String fname) {
    MiniDOM md = new MiniDOM();

    md.register("TAG1", new MiniDOM.IHandler() {
      public void handleIt(Element elem) {
        processTAG1(elem);
      }
    } );    

    md.register("TAG2", new MiniDOM.IHandler() {
      public void handleIt(Element elem) {
        processTAG2(elem);
      }
    } );

    md.process(new FileInputStream(fname));
  }

  public processTAG1(Element elem) {
    //...
  }

  public processTAG2(Element elem) {
    //...
  }
}

In the example, anonymous inner classes are used to redirect to the specified methods. In theory for more complex tasks, different objects might be registered. I have not however found any XML processing tasks for which this simple approach has not been sufficient.

If a document contains 10 or 10 million such elements, there is not the DOM disadvantage of building the whole document model, but rather each element is delivered for processing at the granularity required.

Moreover, if at any point no handlers would be active, then no unnecessary data objects are built and retained.

Minor Complication

There are some situations when a larger context is required. For example, a containing XML element may specify attributes needed to process contained elements correctly.

For this reason, MiniDOM specifies the IPeek interface:

public interface IPeek {
  public void peek(Element elem);
}

Any registered IPeek handlers are called on the startElement event and registered IHandler handlers on the endElement event.

A Simple Example

The kind of thing that MiniDom is very suitable for is where an xml source file has a number of composite nodes - or nodes that do not simply contain attribute data.

Consider the following :

<customers>
  <customer lastname='Brown' firstname='John' telno='02079250918'>
    <address postcode='POS TCODE'>
      <line>XML Processing</line>
      <line>Not So Easy Street</line>
      <line>Someplace</line>
    </address>
  </customer>
  ...
</customers>

Suppose that the source file is called customers.xml and that the task is to generate an html file without bringing in any XSLT experts.

The objective is to make each record look something like this:

John Brown

Address: XML Processing, Not So Easy Street, Someplace, POS TCODE

Telephone: 02079250918

Here is a java method that will do something sensible:

public void transform(String inxml, String outhtml) {
  FileOutputStream outstr = new FileOutputStream(outhtml);
  final PrintWriter f_outp = new PrintWriter(outstr);
  
  f_outp.println("<html><title>Customer List</title><body>");
  f_outp.println("\t<h1>Customer List</h1>");

  MiniDOM md = new MiniDOM();
  
  md.register("customer", new MiniDOM.IHandler() {
  
    // example output
    // <h2>John Brown</h2>
    // <p><b>Address: </b>XML Processing, Easy Street, Someplace, POS TCODE</p>
    // <p><b>Telephone: </b>02079250918</p>
    
    public void handleIt(Element elem) {
      f_outp.println("\t<h2>"
                     + elem.getAttributeValue("firstname")
                     + " "
                     + elem.getAttributeValue("lastname")
                     +"</h2>");

      f_outp.print("\t<p><b>Address: </b>");
      
      Element addr = elem.getElementByTag("address")
      Iterator lines = addr.getElementsByTag("line");
      while (lines.hasNext()) {
        Element line = (Element) lines.next();
        f_outp.print(line.getValue() + ", ");
      }     
      f_outp.println(addr.getAttributeValue("postcode") + "</p>);

      f_outp.println("\t<p><b>Telephone: </b>" + elem.getAttributeValue("telno"));
    }
  } );
  
  md.processFile(new FileInputStream(inxml));
  
  outp.println("");
  outp.close();
  outstr.close();
}

One customer node is processed at a time. No state information is needed to be retained between processing requests - save for the single final PrintWriter.

XPath extension

A recent addition to MiniDOM is some support for XPath syntax.

There is no intention at present to fully implement XPath, but as a syntax for adding functionality to be able to navigate Element objects and to register handlers more accurately it is reasonable.

Here is the same example but utilising XPath constructs (their use is highlited in bold):

public void transform(String inxml, String outhtml) {
  FileOutputStream outstr = new FileOutputStream(outhtml);
  final PrintWriter f_outp = new PrintWriter(outstr);
  
  f_outp.println("<html><title>Customer List</title><body>");
  f_outp.println("\t<h1>Customer List</h1>");

  MiniDOM md = new MiniDOM();
  
  md.register("/customers/customer", new MiniDOM.IHandler() {
  
    // example output
    // <h2>John Brown</h2>
    // <p><b>Address: </b>XML Processing, Easy Street, Someplace, POS TCODE</p>
    // <p><b>Telephone: </b>02079250918</p>
    
    public void handleIt(Element elem) {
      f_outp.println("\t<h2>"
                     + elem.getAttributeValue("firstname")
                     + " "
                     + elem.getAttributeValue("lastname")
                     +"</h2>");

      f_outp.print("\t<p><b>Address: </b>");
      
      Iterator lines = elem.getElementValues("/address/line");
      while (lines.hasNext()) {
        f_outp.print(lines.next() + ", ");
      }     
      f_outp.println(elem.getAttributeValue("/address@postcode") + "</p>);

      f_outp.println("\t<p><b>Telephone: </b>" + elem.getAttributeValue("telno"));
    }
  } );
  
  md.processFile(new FileInputStream(inxml));
  
  outp.println("");
  outp.close();
  outstr.close();
}

The utilisation of the XPath navigation when processing complex Elements may prove very powerful, tho' I have yet to find a practical example that really needs it.

How Can I Get MiniDOM?

MiniDOM can be downloaded from www.cutthecrap.biz/software/downloads.html along with other Cut The Crap software.