Updating-Nodes

Extending Qexo/Kawa for updates

A number of people are interesing in extending XQuery for updates. Here are some useful notes.

Updating means at least two different things: Modifying an in-core node object, and modifying a node in a persistent xml data database. They're very different. Let's start with the former.

Qexo's node model

You might want to read the gnu.lists package descriptor for an overview of the concepts of Kawa's sequence and node objects. A node, in the XML sense, is represented as a pair of an AbstractSequence and an index. The index (a position value) is just a unique number managed by the AbstractSequence. There are a number of implementation classes that extend AbstractSequence, and use different ways of managing position indexes. The one used for XML nodes is a NodeTree, which is an extension of TreeList. The nodes of a document or document fragment are all in a a single NodeTree; each node is identified by a position index, which basically an index in the TreeList's data array, but with the lower-order bit used as a special flag. (See the above-mentioned descriptor in gnu.list.) When we need to create an object for a node, we use a KNode object. The idea is that most nodes aren't actively referenced, so we don't need an actual KNode object, which saves a lot of space.

Updating nodes in-place

To implement updating a node object in-memory we need to finish the update/insert/delete abilities in gnu.lists.TreeList. The latter class is basically a gap-vector (as used in emacs and Swing), but the data structures are more complicated because it stores a hierarchy, rather than just characters. Once we can update the TreeList, we will need an extra level of indirection. The reason is that node identity is tied to the position indexes, but editing a NodeTree causes the nodes in it move around. The solution is to use either StableVector or something similar. Unfortunately, StableVector doesn't currently support TreeList. Perhaps TreeList should be changed to extend GapVector.

A more abstract way to think of it: A Node needs to be a pair of a NodeManager and an index that is managed by the NodeManager. The actual underlying storage is in a TreeList, but since indexes in a TreeList change on updates, the actual Node indexes are indexes in the NodeManager. Each time you read a property of a node, you use the node's index, which is an index in the NodeManager. You use that index in the NodeManager position array, which gives us an index in the TreeList, and get the value from the latter. To update a node, we have to similarly dereference the index in the NodeManager to get an index the the TreeLists's data array, and update the latter. That may require things to move around in the TreeList, so the indexes in the NodeManager have to be updated.

Moving nodes from one document or fragment to another is tricky. The reason is that node indexes are relative to a TreeList. One solution is to use forwarding pointers. Another is a NodeManger that can handle multiple TreeLists.

Updating XML databases

Updating a XML files or a database is more complicated. One approach is reading an XML document, updating nodes in-memory, and writing out the modified document. That is practical for modest-sized XML documents, but expensive for small changes to large documents. Another issue is that it is difficult (but not impossible) to maintain node identity between the original document and the updated version, even for nodes that are unmodified.

Ideally, one would like to modify individial nodes in-place in the database. Thsi is doable in the Kawa node model. The basic idea is to create a AbstractSequence sub-class, which we might call DatabaseDocument. The DatabaseDocument would be a proxy for either the entire database, or an individual xml document. Each node has a database key. The DatabaseDocument object manages the mapping between position indexes and database keys.

Note there are positions of the Qexo run-time that assume nodes are implemented using NodeTree. They would have to be fixed to support general AbstractSequences.

Of course once one is updating a database we also have to deal with transactions and related ACID issues.

Tags: