Fast XML parsing and filtering with Kawa

Oracle's Java group uses the JIRA bug tracker. I've been experimenting with processing and filtering the XML backup file that JIRA can generate. The obvious choice seemed to be XSLT, and it worked pretty well on an evaluation instance of JIRA. Then I got a copy of our 1.5GB production instance. Ouch. XSLT works by building an in-memory DOM of the XML document and then matching templates. There is no chance that is going to work.

Next I tried Joost, an implementation of Streaming Transformations for XML (STX). This has the look of XSLT, but is designed to support one-pass streaming. This worked, but it took about 45 minutes on my laptop, which was discouraging, though better than nothing. (It's also worth noting that STX and Joost don't seem to be actively developed, perhaps because people are focusing on the streaming mode that XSLT 3.0 will have.)

Next I tried the StAX cursor classes that ship with Java 6. StAX is claimed to support fast XML processing. It took about 20 minutes. Better, but hardly fast.

Later (after the Kawa experiments discussed below), I tried an alternative StAX processor: The Aalto XML Processor took just under 1 minute for a full copy. This is pretty good. While not as fast as Kawa, Aalto is constrained by the StAX API, which isn't optimized for performance as well as Kawa's private APIs. This also doesn't seem to be actively maintained.

The Kawa source code includes a smattering of XML tools (including an implementation of XQuery 1.0). I wrote the XML parser with careful attention to performance. For example it avoids copying or needless object allocation. So I was curious how it would handle the 1.5GB JIRA dump. Needless to say - it didn't, at first. But I tracked down and fixed the bugs.

The result was quite gratifying: About 20 seconds for a full copy. I then modified the application to skip Issue, Action, Label, ChangeItem, and CustomFieldValue elements, including their children. That run took about 15 seconds, writing a 296MB output file.

This is the entire program:

import gnu.text.*;
import gnu.xml.*;
import gnu.lists.*;
import java.io.*;

public class KawaFilter extends FilterConsumer {
    int nesting = 0;
    int skipNesting = -1;

    public boolean skipElement(XName name) {
        String local = name.getLocalName();
        if (local=="Issue" || local=="Action" || local=="Label"
            || local=="ChangeItem" || local=="CustomFieldValue")
            return true;
        return false;
    }

    public KawaFilter(Consumer base) {
        super(base);
    }

    @Override
    public void startElement(Object type) {
        nesting++;
        if (! skipping && type instanceof XName
                && skipElement((XName) type)) {
            skipNesting = nesting;
            skipping = true;
        }
        super.startElement(type);
    }

    @Override
    public void endElement() {
        super.endElement();
        if (skipNesting == nesting) {
            skipping = false;
            skipNesting = -1;
        }
        nesting--;
    }

    public static void main(String[] args) throws Throwable {
        String inname = args[0];
        OutputStream outs = new FileOutputStream(args[1]);
        XMLPrinter xout = new XMLPrinter(outs);
        xout.setPrintXMLdecl(true);
        KawaFilter filter = new KawaFilter(xout);
        SourceMessages messages = new SourceMessages();
        xout.beginEntity(inname);
        XMLParser.parse(inname, messages, filter);
        xout.endEntity();
        messages.printAll(System.err, 20);
        xout.close();
    }
}

Most of this is boiler-plate (and should be abstracted out). The only context-specific code is the skipElement method, which is called when the start of an element is seen. The method takes an element tag (QName), and returns true if the element and it's children should be skipped. The startElement and endElement are called by the parser. They work by setting the protected skipping field in the FilterConsumer class.

One could easily modify KawaFilter to rename elements or attributes, or to remove attributes. Changing skipElement so it could base its decision on the names of ancestor elements would also be easy. More challenging would be to allow skipElement to base its decision on the existance or values of attributes. The reason is that startElement is called before the attributes is called: This is for performance, so we don't have to build a data structure for the attributes. However, it would not be difficult to add an option to remember the attributes, and defer the skipElement call to the end of the element start-tag, rather than the beginning.

You run KawaFilter thus:

$ java java -cp .:kawa.jar KawaFilter input.xml output.xml

For kawa.jar you need to build Kawa from the Subversion repository (until I make a binary release containing the fixed XML classes).

Tags: