Title

XML reader syntax

Author

Per Bothner <per@bothner.com>

Status

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 107 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Received: 2012/11/03
Draft: 2012/11/10-2013/01/10
Revision: 2013/02/04
Revision: 2013/11/03

Abstract

We specify a reader extension that reads data in a superset of XML/HTML format, and produces conventional S-expressions. We also suggest a possible semantics interpretation of how these forms may be evaluated to produce XML-node values, but this is non-normative.

Rationale

While XML may be a poor re-invention of S-expressions, many people are familiar with it. Furthermore, when working with XML or HTML data, using XML syntax may be preferable to S-expressions. This specification defines a Scheme reader extension matching XML syntax with expression escapes (unquote), a translation into standard S-expressions, and a semantics for the latter.

Some other programming languages also define a syntax for XML literals. Examples include EcmaScript for XML (E4X), Visual Basic, XQuery, and Scala.

Here is a simple example:

#<p>The result is <b>final</b>!</p>

This is reader sugar equivalent to the S-expression:

($xml-element$ () ($resolve-qname$ p) "The result is "
 ($xml-element$ () ($resolve-qname$ b) "final") "!")

One use case for this syntax is as a standard data representation (interchange format) for XML values; one can either use the (relatively) human-readable syntax or the equivalent de-sugared S-expressions.

When used inside a program, the assumption is that such expressions will be evaluated in the context of a definition for $xml-element$ and other forms in this specification. The definition of $xml-element$ is not formally part of this specification, and there may be different libraries that provide multiple possible implementations. For example:

The context may be an HTTP server, and the effect of evaluating the $xml-element$ form is to assemble the XML or HTML response to an HTTP request, perhaps by making SAX-style calls to an implicit output port.
Alternatively, an $xml-element$ form is a constructor for an element node object similar to W3C Document Object Model (DOM). This use case subsumes the former, since it is always possible to print out (serialize) an element node after it has been constructed. However, the former is probably more efficient.

The syntax provides the functionality of quasi-literals since they can contain enclosed expressions, which are unquoted:

#<em>The total is &[result].</em>

Notice the use of &, which is used in XML for character and entity references, but we use it as a multi-purpose prefix character to avoid adding extra special characters that might need escaping. The Scheme reader turns the above into:

($xml-element$ () ($resolve-qname$ em) "The total is " $<<$ result $>>$ ".")

The value of result is substituted into the output, in a similar way to quasi-quotation. The special symbols $<<$ and $>>$ allow the implementation of $xml-element$ to differentiate between literal text and a string literal in an enclosed expression. For example some XML processors distinguish between text nodes and atomic string values. (The same convention is used in SRFI-108 and SRFI-108.)

Discussion: (Not part of this specification, but perhaps a future specification.) The XML data model distinguishes between a document node and a document element. A document element is just an XML element node that is the top-level element in a document. A document node is a special kind of node whose primary child is the document element, but may have other children (comments and processing instructions) and DTD properties such as a public identifier. This specification provides a syntax for creating XML elements, but does not provide a mechanism for representing whole XML documents (i.e. document nodes and their properties). A possible way to create document values is to use a SRFI-108 named literal. For example:

&xml{<!DOCTYPE html>
<html>
<body>Hello &[name]!</>
</html>}

One could also support more structured prefix arguments:

&xml[version: 1.1 encoding: "UTF-8" standalone: #t
  doctype: "HTML"
  public: "-//W3C//DTD HTML 4.01 Transitional//EN"]
{
<html>...</>
}

Or just extend xml-literal:

#<!DOCTYPE html><html>
<body>Hello &[name]!</>
</html>

and/or:

#<?xml?><html>
<body>Hello &[name]!</>
</html>

Specification

Syntax

An xml-literal is usually an element constructor. We'll cover later the less common processing instruction, comment, and CDATA-section forms.

xml-literal ::= # xml-constructor

xml-constructor ::= xml-element-constructor
  | xml-PI-constructor
  | xml-comment-constructor
  | xml-CDATA-constructor

Qualified names

The names of elements and attributes are qualified names (QNames). The lexical syntax for a QName is either a simple name, or a (prefix,local-name) pair. Specifically:

QName ::= xml-local-part
   | xml-prefix : xml-local-part
xml-local-part ::= NCName
xml-prefix ::= NCName

An NCName is similar to a Scheme identifier, but with restrictions as defined in the XML namespaces specification. An implementation without full Unicode support may restrict NCName to the following syntax:

NCName ::= letter (letter | digit | hyphen | underscore | period)^*
hyphen ::= -
underscore ::= _
period ::= .

As a matter of style, programs are recommended to limit NCName to the above, though an implementation should allow the full XML NCName syntax.

Sometimes one needs to calculate the QName at runtime, evaluating an expression instead of using a literal QName:

xml-name-form ::= QName
  | xml-enclosed-expression
xml-enclosed-expression ::=
    [ expression^* ]
  | ( expression⁺ )

The first variant is the general case; the second variant (expression⁺) is just syntactic sugar for: [(expression⁺)]. For example the following equivalent forms:

#<[(if be-bold 'strong 'em)]>important</>
#<(if be-bold 'strong 'em)>important</>

When evaluating the expression (in the first variant), the result is a QName value. While this specification does not define an API or representation for QName values, it is an object with three string components: The local name part, the prefix part, and the namespace URI part. The local name and the prefix parts match the parts in a literal QName, while the namespace URI part is an arbitrary globally unique string. Two QNames are considered equivalent if they have the same local name part and namespace URI part, even if the prefix parts are different. The prefix is used for input and output; it can be considered a local nickname for a namespace URI. The binding from a prefix to a namespace URI can be defined using namespace-declaration-attribute. An implementation may also define such bindings using Scheme code; for example Kawa has a define-namespace form.

This specification specifies that a symbol is considered equivalent to a QName whose local name part is the string name of the symbol, and whose prefix and namespace URI are both empty, as long as the name of the symbol matches the syntax of identifier, and does not contain a colon. The result is implementation-defined if a symbol's name contains a colon.

Element constructors

xml-element-constructor ::=
    < QName xml-attribute^* > initial-ignored^? xml-element-datum^* </ QName >
  | < xml-name-form xml-attribute^* > initial-ignored^? xml-element-datum^* </>
  | < xml-name-form xml-attribute^* />

The first xml-element-constructor variant uses a literal QName, and looks like a standard non-empty XML element, where the starting QName and the ending QName must match exactly:

#<a href="next.html">Next</a>

As a convenience, you can leave out the name in end-tag:

<para>This is a paragraph in <emphasis>DocBook</> syntax.</>

You can use an expression to compute the name in the start-tag at runtime - in that case you must leave out the name in the end-tag:

#<p>This is <[(if be-bold 'strong 'em)]>important</>!</p>

The third xml-element-constructor variant above is an XML “empty element”; it is equivalent to the second variant when there are no xml-element-datum items.

(Note that every well-formed XML element, as defined in the XML specifications, is a valid xml-element-constructor, but not vice versa.)

Element contents (children)

The “contents” (children) of an element are a sequence of character (text) data, nested nodes, and enclosed (unquoted) expressions. The latter are discussed later.

xml-element-datum ::=
    any character except &, or <.
  | xml-constructor
  | xml-escaped

The characters & and < are special and need to be escaped.

The character > does not have to be escaped, but it is good style to always do so, as it makes it easier to visually distinguish it from markup. (The MicroXML proposal does not even allow unquoted >.) The XML and HTML 4.x standards do not allow the literal text ]]> in element content, for historical reasons of SGML-compatibility. For this reason an implementation of this specification may warn if literal ]]> is seen.

A nested xml-constructor is functionally equivalent to an xml-literal (i.e. the xml-constructor prefixed by a #) inside an enclosed expression. For example:

#<p>This is <em>important</em>!</p>

is equivalent to:

#<p>This is &[#<em>important</em>]!</p>

xml-escaped ::=
    & xml-enclosed-expression
  | & xml-entity-name ;
  | xml-character-reference
  | special-escape

Character and entity references

xml-character-reference ::=
    &# digit⁺ ;
  | &#x hex-digit⁺ ;
xml-entity-name ::= NCName

Here is an example with both hex and decimal character references:

#<p>A&#66;C&#x44;E</p>  ⟹  <p>ABCDE</p>

An implementation must support the built-in XML names for xml-entity-name: lt, gt, amp, quot, and apos, which stand for the characters <, >, &, ", and ', respectively. An implementation should also support the standard XML entity names (though resource-limited or non-Unicode-based implementations are not required to), and should also support the standard R7RS character names tab, newline, return, and space. An implementation may support the R7RS character names null, alarm, backspace, escape, and delete, though these are not valid XML 1.0 characters. The following two expressions are equivalent:

#<p>&lt; &gt; &amp; &quot; &apos;</p>
#<p>&{"< > & \" '"}</p>

Indentation and line-endings

SRFI-109 (extended string quasi-literals) uses the same style of escape sequences, prefixed by &. The following SRFI-109 features are optional for SRFI-107 implementations; however, an implementation that provides both SRFI-109 and SRFI-107 should provide these convenience features in attribute and element content:

line-continuation (using &-);
indentation handling (using &|);
comments (using &#|comment|#); and
(optionally) format specifiers (as in &~,2f[balance-due]).

special-escape ::=
    intraline-whitespace &|
  | & nested-comment
  | &- intraline-linespace line-ending
initial-ignored ::=
    intraline-whitespace line-ending intraline-whitespace &|

For discussion of these features see the Indentation and line-endings and Embedded comments sections of SRFI-109.

Attributes

An attribute associates an attribute name with an attribute value. This is done using an xml-true-attribute form, which is an xml-attribute that does not have the form of xml-namespace-declaration-attribute. I.e. in a xml-true-attribute the attribute name may not be the special reserved name xmlns, nor may it be a QName whose prefix is the special reserved name xmlns.

xml-attribute ::=
    xml-true-attribute
  | xml-namespace-declaration-attribute

A true attribute has the form name=value. It can also be an enclosed expression that evaluates to an attribute node value.

xml-true-attribute ::=
    xml-name-form = xml-attribute-value
  | xml-enclosed-expression
xml-attribute-value ::=
    " quot-attribute-datum^* "
  | ' apos-attribute-datum^* '
  | [ expression^* ]
  | ( expression⁺ )

quot-attribute-datum ::=
    any character except ", &, or <.
  | xml-escaped
apos-attribute-datum ::=
    any character except ', &, or <.
  | xml-escaped

Namespace declarations

An xml-prefix is an alias for a namespace-uri, and the mapping between them is defined by a namespace declaration attribute, which has the form of an xml-attribute where either the QName or the prefix is the special identifier xmlns:

xml-namespace-declaration-attribute ::=
    xmlns: xml-prefix = xml-attribute-value
  | xmlns= xml-attribute-value

The former declares xml-prefix as a namespace alias for the namespace-uri specified by xml-attribute-value (which should be a compile-time constant, though an implementation may allow a general string-valued expression). The second declares that xml-attribute-value is the default namespace for unprefixed element names. (A default namespace declaration is ignored for attribute names.)

Processing instructions

An xml-PI-constructor can be used to create an XML processing instruction, which can be used to pass instructions or annotations to an XML processor or tool.

xml-PI-constructor ::= <? xml-PI-target xml-PI-content ?>
xml-PI-target ::= NCName
xml-PI-content ::= any characters, not containing ?>.

For example, the DocBook XSLT stylesheets can use the dbhtml instructions to specify that a specific chapter should be written to a named HTML file:

#<chapter><?dbhtml filename="intro.html" ?>
<title>Introduction</title>
...
</chapter>

XML comments

You can cause XML comments to be emitted in the XML output document. Such comments can be useful for humans reading the XML document, but are usually ignored by programs.

xml-comment-constructor ::= <!-- xml-comment-content -->
xml-comment-content ::= any characters, not containing --.

CDATA sections

A CDATA section can be used to avoid excessive quoting in element content.

xml-CDATA-constructor ::= <![CDATA[ xml-CDATA-content ]]>
xml-CDATA-content ::= any characters, not containing ]]>.

A CDATA section is semantically equivalent to text consisting of the xml-CDATA-content, though some implementations may record that the text came from a CDATA so it can be written out the same way.

The following are equivalent:

#<p>Special characters <![CDATA[< > & ' "]]> here.</p>
#<p>Special characters &lt; &gt; &amp; &quot; &apos; here.</p>

Translation into core S-expressions

The following specifies how the reader syntax is translated by the reader into standard S-expressions. These basically create macro invocations; the implementation is responsible for implementing those macros as described in the Semantics section. As an example:

#<a class="title">Result: &{sum}.</a>

is read as if it were:

($xml-element$ () ($resolve-qname$ a)
  ($xml-attribute$ 'class "title")
  "Result: " sum ".")

The () in the result is the translation of any namespace declaration attributes - in this case none. Here is an example with namespace declarations:

#<prefix2:a
   xmlns:prefix1="URI1"
   xmlns:prefix2="URI&foo;2"
   xmlns="DURI">...</prefix2:a>

This is read as:

($xml-element$ ((prefix1 "URI1")
                (prefix2 "URI" $entity$:foo "2")
                (|| "DURI"))
               ($resolve-qname$ a prefix2) ...)

The translation is defined in terms of a recursive read-time translation function Tr which maps an xml-constructor to an S-expression.

Tr[< QName xml-attribute^* > xml-element-datum^* </ QName >]
   ⟾ Tr[< QName xml-attribute^* > xml-element-datum^* </>]
Tr[< QName xml-attribute^* />]
   ⟾ Tr[< QName xml-attribute^* ></>]
Tr[< xml-name-form xml-attribute^* > xml-element-datum^* </>]
   ⟾ ($xml-element$ (TrNamespaceDecl[xml-attribute]^* ) TrElementName[xml-name-form] TrAttr[xml-attribute]^* TrContent[xml-element-datum]^* )

TrContent is as in SRFI-109, except we add this rule:

TrContent[xml-constructor]
   ⟾ Tr[xml-constructor]

TrAttr[xml-namespace-declaration-attribute]
   ⟾ #|nothing|#
TrAttr[xml-enclosed-expression]
  ⟾ xml-enclosed-expression
TrAttr[xml-name-form = xml-attribute-value]
   ⟾ ($xml-attribute$ TrAttrName[xml-name-form] TrAttrValue[xml-attribute-value] )
TrAttrValue[" quot-attribute-datum^* "]
   ⟾ TrContent[quot-attribute-datum]^*
TrAttrValue[' apos-attribute-datum^* ']
   ⟾ TrContent[apos-attribute-datum]^*
TrAttrValue[[ expression^* ]]
   ⟾ expression^*
TrAttrValue[( expression⁺ )]
   ⟾ ( expression⁺ )

The namespace-declarations are translated to a list of namespace-bindings, by default the empty list. There is a sub-list for each namespace-binding, where the first element is the prefix being bound, and the remaining elements (usually just a single string literal) an expression that evaluates to a namespace URI. The prefix is a symbol; in the case of a default element namespace, the prefix is either the empty symbol (||) or equivalently the reserved prefix name $default-element-namespace$ .

TrNamespaceDecl[xml-true-attribute]
   ⟾ #|nothing|#
TrNamespaceDecl[xmlns: xml-prefix = xml-attribute-value]
   ⟾ ( xml-prefix TrAttrValue[xml-attribute-value] )
TrNamespaceDecl[xmlns= xml-attribute-value]
   ⟾ (||  TrAttrValue[xml-attribute-value] )

Element (tag) names are translated by TrElementName, while attribute names are translated by TrAttrName. These are both handled by TrElementOrAttrName in both cases. However, if there is no namespace-prefix, then attribute names default to the empty namespace, but element names default to the current default element namespace prefix (indicated by $default-element-namespace$ ).

TrElementName[identifier]
   ⟾ ($resolve-qname$ identifier )
TrAttrName[identifier]
   ⟾ (quote identifier )
TrAttrName[other-form]
   ⟾ TrElementOrAttrName[other-form]
TrElementName[other-form]
   ⟾ TrElementOrAttrName[other-form]
TrElementOrAttrName[prefix:local-name]
   ⟾ ($resolve-qname$ local-name prefix )
TrElementOrAttrName[( expression⁺ )]
   ⟾ ( expression⁺ )
TrElementOrAttrName[[ expression ]]
   ⟾ expression

The special node constructors are translated similarly: (Note: This is simplified, since these forms should not handle escape characters the way element and attribute content does.)

Tr[<![CDATA[xml-CDATA-content]]>]
   ⟾ ($xml-CDATA$ "xml-CDATA-content")
Tr[<--xml-comment-content-->]
   ⟾ ($xml-comment$ "xml-comment-content")
Tr[<? xml-PI-target xml-PI-content ?>]
   ⟾ ($xml-processing-instruction$ " xml-PI-target " TrContent[xml-PI-content])

Semantics

The above translation maps the new reader syntax to S-expressions using macros specified in this section. Of course it is possible to write these macro forms directly, though they are less human-readable. However, code generators and macros may target these macros. This format can also be used as an interchange format. When below we say that an expression "creates an element node" we may that we create a representation of an element value. The default implementation should create a unique object, of a sub-type of XML-node. However, a keyword such as $xml-element$ may be bound to a user-defined macro, in which case the element value may be something very different and perhaps emphemeral, such as a network encoding.

The specification does not define a Scheme API for working with XML data. It assumes there is some data type which we here call an XML-node. This specification does not require the XML-node type to be distinct from other types. Many Scheme XML libraries just use lists to encode XML-nodes. However, newer Schemes that have an extensible type system are encouraged to make XML-node a distinct type. This follows the W3C Document Object Model (DOM).

Contructors and other bindings

($xml-element$ ( namespace-binding^* ) name attribute^* content^* )

Creates an element node.

Each namespace-binding is a list of the form (prefix namespace-uri-part⁺). The prefix is a literal symbol that represents a namespace prefix; a zero-length symbol is equivalent to the symbol $default-element-namespace$ . Each namespace-uri-part is an expression that evaluates to a string; the namespace-URI is the concatenation of the parts. Normally there will be a single namespace-uri-part that is a literal string; there may also be entity references. An implementation may allow non-literal expressions, but is not required to.

The name is an expression that evaluates to a symbol or a QName, most commonly a quoted symbol or a $resolve-qname$ form.

The binding for $xml-element$ must be a macro, not a function, because each namespace-binding adds a (prefix,URI)-binding in the lexical context. That binding is used to evaluate QNames in the remaining parameters, which are all expressions, including the name.

Each attribute is usually an $xml-attribute$ form, but an implementation may support other expressions that evaluate to attribute nodes. Each content is an expression that evaluates to element content, handled as described in the Handling of enclosed expressions section.

($xml-attribute$ name content^* )

Creates an attribute node from the parameters. The name is an expression that evaluates to a symbol or QName value. The content arguments are concatenated to produce the attribute value.

($resolve-qname$ local-name [prefix])

Resolve the gives prefix/local-name-pair to a QName value, depending on the currently active namespace bindings. Both arguments are literal unquoted symbols. If prefix is missing it defaults $default-element-namespace$ .

($xml-comment$ content^* )

($xml-CDATA$ content^* )

($xml-processing-instruction$ xml-PI-target content^* )

Creates a comment, CDATA, or processing instruction (PI) node. The xml-PI-target should be a string that matches NCName. The content arguments should be strings, which are concatenated.

$entity$:xml-entity-name

The xml-entity-name is an unquoted symbol. Returns a string value matching the entity name. For example:

$entity$:lt ⟹ "<"

$<<$
$>>$

These serve to delimit enclosed expressions, but are otherwise ignored in content. A simple implementation is to bind them to unique objects.

Handling of enclosed expressions

Both element content and attribute values may contain xml-enclosed-expressions. These are expressions evaluated at runtime, where the evaluated result becomes part of the element content or the attribute value.

If the expression evaluates to an element, comment, or processing node, and the context is element content, then the node is added as a child of the element. It is unspecified if the node is copied or shared. It is also unspecified if the expression result is some other kind of XML-node, or the context is an attribute value.

If the expression evaluates to a string, the result is pasted as a text (child) content of an element or a substring of an attribute value, respectively.

If the expression evaluates to a CDATA segment, the result is equivalent to the string value of the segment.

If the expression evaluates to some other scalar value (including numbers, booleans, and characters) the value is converted to a string according to implementation-specified rules. An implementation may convert a value as if using display. Alternatively, an implementation may convert a value to yield a canonical representation according to the XML Schema specification. (In the latter case, booleans #f and #t should yield false and true, respectively.)

If the expression evaluates to a list or vector, then each element is inserted into the element or attribute content. Spaces are inserted between two elements if neither element is an XML-node.

Note that some XML specifications (include XML Schema and the XQuery and XPath data model) have the concept of typed value of a node. The typed value may be a number, a string, or another atomic type. The typed value may also be a sequence of strings, numbers, or other atomic values. Some implementations may optionally store the typed value instead of or in addition to the text value. For example:

#<prices>&(vector 230 599 98 763)</prices>

It is undefined if in the XML-node the contents is stored as a sequence of four integers, or as the string "230 599 98 763", as long as the result prints the same way.

Output of XML nodes

If XML-node is a separate data-type, implementations are encouraged to use this XML-literal format when writing to an output port, since this provides input-output round-tripping. Specifically, calling write on an XML-node should write an xml-literal (with an initial #). Calling display on an XML-node should write an xml-constructor (without an initial #). The xml-constructor should be in standard XML syntax without using any of extensions in this specification, such as an unnamed end-tag, or an unescaped ]]>. In fact, it is strongly recommended that if > appears in element or attribute content it should be written as the escaped form >. In addition, control characters (in attribute content also including newline or tab) should be escaped using character references.

Alternatively, for display (not write), if the output port is an extended port that can handle rich text then an implementation may instead display a styled representation. For example if the XML-node is compatible with HTML, and the output port is inserting text into a browser document, then the implementation may copy the DOM into the browser, perhaps resulting in styled text.

Implementation

The implementation is necessarily non-portable, though the Translation section provides a template for the reader part.

The Kawa Scheme implementation has working support for this reader extension.

Handling namespaces

Implementing of the Translated forms should mostly be obvious: Just call an appropriate function to create the XML-node. Handling namespace definitions is non-obvious, however. The form:

($xml-element$ ((prefix namespace-uri-part⁺)^*) name attribute^* content^*)

can be translated into something like:

(let ()
  (define-namespace prefix (string-append namespace-uri-part⁺))^*
  (make-element name attribute^* content^*)

The translation is more complicated if in-scope namespace bindings are part of the run-time properties of the constructed element. In that case the set of binding needs to be passed to the make-element function.

The initial environment has predefined:

(define-namespace $default-element-namespace$ "")

A simple way to implement define-namespace is to expand:

(define-namespace prefix namespace-uri)

to:

(define $namespace$:prefix namespace-uri)

In that case:

($resolve-qname$ local-name prefix)

could be implemented as:

(make-qname local-name prefix $namespace$:prefix)

assuming a 3-argument make-qname function that creates a QName with the given local-name, prefix, and namespace-uri. Implementations should provide a custom error message in the case $namespace$:prefix is undefined, rather than depend on a generic error message.

Test suite

There is a test suite in the Kawa source tree.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Author: Per Bothner

Editor: Mike Sperber