Fast XML parsing and filtering with Kawa
Oracle's Java group uses the JIRA bug tracker. I've been experimenting with processing and filtering the XML backup file that JIRA can generate. The obvious choice seemed to be XSLT, and it worked pretty well on an evaluation instance of JIRA. Then I got a copy of our 1.5GB production instance. Ouch. XSLT works by building an in-memory DOM of the XML document and then matching templates. There is no chance that is going to work.
Next I tried Joost, an implementation of Streaming Transformations for XML (STX). This has the look of XSLT, but is designed to support one-pass streaming. This worked, but it took about 45 minutes on my laptop, which was discouraging, though better than nothing. (It's also worth noting that STX and Joost don't seem to be actively developed, perhaps because people are focusing on the streaming mode that XSLT 3.0 will have.)
Next I tried the StAX cursor classes that ship with Java 6. StAX is claimed to support fast XML processing. It took about 20 minutes. Better, but hardly fast.
Later (after the Kawa experiments discussed below), I tried an alternative StAX processor: The Aalto XML Processor took just under 1 minute for a full copy. This is pretty good. While not as fast as Kawa, Aalto is constrained by the StAX API, which isn't optimized for performance as well as Kawa's private APIs. This also doesn't seem to be actively maintained.
The Kawa source code includes a smattering of XML tools (including an implementation of XQuery 1.0). I wrote the XML parser with careful attention to performance. For example it avoids copying or needless object allocation. So I was curious how it would handle the 1.5GB JIRA dump. Needless to say - it didn't, at first. But I tracked down and fixed the bugs.
The result was quite gratifying: About 20 seconds for a full copy.
I then modified the application to skip Issue
,
Action
, Label
, ChangeItem
,
and CustomFieldValue
elements, including their children.
That run took about 15 seconds, writing a 296MB output file.
This is the entire program:
import gnu.text.*; import gnu.xml.*; import gnu.lists.*; import java.io.*; public class KawaFilter extends FilterConsumer { int nesting = 0; int skipNesting = -1; public boolean skipElement(XName name) { String local = name.getLocalName(); if (local=="Issue" || local=="Action" || local=="Label" || local=="ChangeItem" || local=="CustomFieldValue") return true; return false; } public KawaFilter(Consumer base) { super(base); } @Override public void startElement(Object type) { nesting++; if (! skipping && type instanceof XName && skipElement((XName) type)) { skipNesting = nesting; skipping = true; } super.startElement(type); } @Override public void endElement() { super.endElement(); if (skipNesting == nesting) { skipping = false; skipNesting = -1; } nesting--; } public static void main(String[] args) throws Throwable { String inname = args[0]; OutputStream outs = new FileOutputStream(args[1]); XMLPrinter xout = new XMLPrinter(outs); xout.setPrintXMLdecl(true); KawaFilter filter = new KawaFilter(xout); SourceMessages messages = new SourceMessages(); xout.beginEntity(inname); XMLParser.parse(inname, messages, filter); xout.endEntity(); messages.printAll(System.err, 20); xout.close(); } }
Most of this is boiler-plate (and should be abstracted out).
The only context-specific code is the skipElement
method, which is called when the start of an element is seen.
The method takes an element tag (QName), and returns true if
the element and it's children should be skipped.
The startElement
and endElement
are called by the parser. They work by setting the protected
skipping
field in the
FilterConsumer
class.
One could easily modify KawaFilter
to rename
elements or attributes, or to remove attributes.
Changing skipElement
so it could base its decision
on the names of ancestor elements would also be easy.
More challenging would be to allow skipElement
to base its decision on the existance or values of attributes.
The reason is that startElement
is called before
the attributes is called: This is for performance, so we don't
have to build a data structure for the attributes.
However, it would not be difficult to add an option to
remember the attributes, and defer the skipElement
call to the end of the element start-tag, rather than the beginning.
You run KawaFilter
thus:
$ java java -cp .:kawa.jar KawaFilter input.xml output.xml
For kawa.jar
you need to build
Kawa from the Subversion repository (until I make a binary release containing the fixed XML classes).