Per Bothner <per@bothner.com>
This SRFI is currently in ``draft'' status. To see an explanation of
each status that a SRFI can hold, see here.
To provide input on this SRFI, please
mail to
<srfi minus 107 at srfi dot schemers dot org>
. See
instructions here to
subscribe to the list. You can access previous messages via
the archive of the mailing list.
We specify a reader extension that reads data in a superset of XML/HTML format, and produces conventional S-expressions. We also suggest a possible semantics interpretation of how these forms may be evaluated to produce XML-node values, but this is non-normative.
While XML may be a poor re-invention of S-expressions, many people are familiar with it. Furthermore, when working with XML or HTML data, using XML syntax may be preferable to S-expressions. This specification defines a Scheme reader extension matching XML syntax with expression escapes (unquote), a translation into standard S-expressions, and a semantics for the latter.
Some other programming languages also define a syntax for XML literals. Examples include EcmaScript for XML (E4X), Visual Basic, XQuery, and Scala.
Here is a simple example:
#<p>The result is <b>final</b>!</p>
This is reader sugar
equivalent to the S-expression:
($xml-element$ () ($resolve-qname$ p) "The result is " ($xml-element$ () ($resolve-qname$ b) "final") "!")
One use case for this syntax is as a standard data representation (interchange format) for XML values; one can either use the (relatively) human-readable syntax or the equivalent de-sugared S-expressions.
When used inside a program,
the assumption is that such expressions will be evaluated in
the context of a definition for $xml-element$
and other forms in this specification.
The definition of $xml-element$
is not formally
part of this specification, and there may be different libraries
that provide multiple possible implementations. For example:
$xml-element$
form is to assemble the XML or HTML
response to an HTTP request, perhaps by making
SAX-style calls
to an implicit output port.
$xml-element$
form is a constructor
for an element node object similar to
W3C Document Object Model (DOM).
This use case subsumes the former, since it is always
possible to print out (serialize) an element node after it has been constructed. However, the former is probably more efficient.
The syntax provides the
functionality of quasi-literals
since they can contain enclosed expressions, which are unquoted
:
#<em>The total is &[result].</em>
Notice the use of
, which is used in XML
for character and entity references, but we use it as a multi-purpose prefix
character to avoid adding extra special characters that
might need escaping. The Scheme reader turns the above into:
&
($xml-element$ () ($resolve-qname$ em) "The total is " $<<$ result $>>$ ".")
The value of result is substituted into the output,
in a similar way to quasi-quotation.
The special symbols $<<$
and $>>$
allow the implementation of $xml-element$
to differentiate
between literal text and a string literal in an enclosed expression.
For example some XML processors distinguish between text nodes
and atomic string values.
(The same convention is used in SRFI-108 and SRFI-108.)
Discussion: (Not part of this specification, but perhaps a future specification.) The XML data model distinguishes between a document node and a document element. A document element is just an XML element node that is the top-level element in a document. A document node is a special kind of node whose primary child is the document element, but may have other children (comments and processing instructions) and DTD properties such as a public identifier. This specification provides a syntax for creating XML elements, but does not provide a mechanism for representing whole XML documents (i.e. document nodes and their properties). A possible way to create document values is to use a SRFI-108 named literal. For example:
&xml{<!DOCTYPE html> <html> <body>Hello &[name]!</> </html>}
One could also support more structured prefix arguments:
&xml[version: 1.1 encoding: "UTF-8" standalone: #t doctype: "HTML" public: "-//W3C//DTD HTML 4.01 Transitional//EN"] { <html>...</> }
Or just extend xml-literal:
#<!DOCTYPE html><html> <body>Hello &[name]!</> </html>and/or:
#<?xml?><html> <body>Hello &[name]!</> </html>
An xml-literal
is usually an element constructor.
We'll cover later the less common processing instruction,
comment, and CDATA-section forms.
xml-literal ::= #
xml-constructor
xml-constructor
::= xml-element-constructor
| xml-PI-constructor
| xml-comment-constructor
| xml-CDATA-constructor
The names of elements and attributes are qualified names (QNames). The lexical syntax for a QName is either a simple name, or a (prefix,local-name) pair. Specifically:
QName ::= xml-local-part
| xml-prefix :
xml-local-part
xml-local-part ::= NCName
xml-prefix ::= NCName
An NCName is similar
to a Scheme identifier,
but with restrictions as defined in the
XML namespaces specification.
An implementation without full Unicode support may
restrict NCName to the following syntax:
NCName ::= letter (letter | digit | hyphen | underscore | period)* hyphen ::=As a matter of style, programs are recommended to limit NCName to the above, though an implementation should allow the full XML NCName syntax.-
underscore ::=_
period ::=.
Sometimes one needs to calculate the QName at runtime, evaluating an expression instead of using a literal QName:
xml-name-form ::= QName | xml-enclosed-expression xml-enclosed-expression ::=[
expression*]
|(
expression+)
The first variant is the general case; the second variant (
expression+)
is just syntactic sugar for:
[(
expression+)]
. For example the following equivalent forms:
#<[(if be-bold 'strong 'em)]>important</> #<(if be-bold 'strong 'em)>important</>
When evaluating the expression (in the first variant), the result is a QName value
. While this specification does
not define an API or representation for QName values, it is an object
with three string components: The local name part,
the prefix part,
and the namespace URI part.
The local name and the prefix parts match the parts in a literal QName,
while the namespace URI part is an arbitrary globally unique string.
Two QNames are considered equivalent if they have the same
local name part and namespace URI part, even if the prefix parts are
different. The prefix is used for input and output;
it can be considered a local nickname for a namespace URI.
The binding from a prefix to a namespace URI can be defined
using namespace-declaration-attribute.
An implementation may also define such bindings using Scheme code; for example
Kawa has a define-namespace
form.
This specification specifies that a symbol is considered equivalent to a QName whose local name part is the string name of the symbol, and whose prefix and namespace URI are both empty, as long as the name of the symbol matches the syntax of identifier, and does not contain a colon. The result is implementation-defined if a symbol's name contains a colon.
xml-element-constructor ::=<
QName xml-attribute*>
initial-ignored? xml-element-datum*</
QName
>
|<
xml-name-form xml-attribute*>
initial-ignored? xml-element-datum*</>
|<
xml-name-form
xml-attribute*/>
The first xml-element-constructor
variant uses a literal QName
,
and looks like a standard non-empty XML element, where the starting QName
and the ending QName
must match exactly:
#<a href="next.html">Next</a>
As a convenience, you can leave out the name in end-tag:
<para>This is a paragraph in <emphasis>DocBook</> syntax.</>
You can use an expression to compute the name in the start-tag at runtime - in that case you must leave out the name in the end-tag:
#<p>This is <[(if be-bold 'strong 'em)]>important</>!</p>
The third xml-element-constructor
variant above is an XML
“empty element”; it is equivalent to the second variant
when there are no xml-element-datum
items.
(Note that every well-formed XML element, as defined in the XML specifications,
is a valid xml-element-constructor
, but not vice versa.)
The “contents” (children) of an element are a sequence of character (text) data, nested nodes, and enclosed (unquoted) expressions. The latter are discussed later.
xml-element-datum ::= any character except&
, or<
. | xml-constructor | xml-escaped
The characters &
and
<
are special and need to be escaped.
The character >
does not have to
be escaped, but it is good style to always do so, as it makes it
easier to visually distinguish it from markup.
(The MicroXML
proposal does not even allow unquoted >
.)
The XML and HTML 4.x standards do not allow
the literal text ]]>
in element content,
for historical reasons of SGML-compatibility.
For this reason an implementation of this specification may
warn if literal ]]>
is seen.
A nested xml-constructor
is functionally equivalent to an xml-literal
(i.e. the xml-constructor prefixed
by a #
) inside an enclosed expression.
For example:
#<p>This is <em>important</em>!</p>is equivalent to:
#<p>This is &[#<em>important</em>]!</p>
xml-escaped ::=&
xml-enclosed-expression |&
xml-entity-name;
| xml-character-reference | special-escape
xml-character-reference ::=&#
digit+;
|&#x
hex-digit+;
xml-entity-name ::= NCName
Here is an example with both hex and decimal character references:
#<p>ABCDE</p> ⟹ <p>ABCDE</p>
An implementation must support the built-in XML names
for xml-entity-name
:
lt
, gt
, amp
,
quot
, and apos
, which stand for the characters
<
, >
, &
, "
, and '
, respectively.
An implementation should also support
the standard XML entity names
(though resource-limited or non-Unicode-based implementations
are not required to),
and should also support the standard
R7RS character names tab
, newline
,
return
, and space
.
An implementation may support the R7RS character names
null
, alarm
,
backspace
, escape
, and delete
,
though these are not valid XML 1.0 characters.
The following two expressions are equivalent:
#<p>< > & " '</p> #<p>&{"< > & \" '"}</p>
SRFI-109 (extended string quasi-literals) uses the same style of escape sequences, prefixed by
&
.
The following SRFI-109 features are optional for SRFI-107 implementations;
however, an implementation that provides both
SRFI-109 and SRFI-107 should provide these convenience features
in attribute and element content:
&-
);
&|
);
&#|comment|#
); and
&~,2f[balance-due]
).
special-escape ::= intraline-whitespaceFor discussion of these features see the Indentation and line-endings and Embedded comments sections of SRFI-109.&|
|&
nested-comment |&-
intraline-linespace line-ending initial-ignored ::= intraline-whitespace line-ending intraline-whitespace&|
An attribute associates an attribute name with an attribute value.
This is done using an xml-true-attribute form,
which is an xml-attribute
that does not have the form of
xml-namespace-declaration-attribute.
I.e. in a xml-true-attribute the
attribute name may not be the special reserved name
xmlns
, nor may it be a QName whose
prefix is the special reserved name xmlns
.
xml-attribute ::= xml-true-attribute | xml-namespace-declaration-attribute
A true attribute has the form name=value
.
It can also be an enclosed expression that evaluates to an attribute node value.
xml-true-attribute ::= xml-name-form=
xml-attribute-value | xml-enclosed-expression xml-attribute-value ::="
quot-attribute-datum*"
|'
apos-attribute-datum*'
|[
expression*]
|(
expression+)
quot-attribute-datum ::= any character except"
,&
, or<
. | xml-escaped apos-attribute-datum ::= any character except'
,&
, or<
. | xml-escaped
An xml-prefix
is an alias for a namespace-uri,
and the mapping between them is defined by a namespace declaration attribute,
which has the form of an xml-attribute
where either the QName or the prefix is the special identifier
xmlns
:
xml-namespace-declaration-attribute ::=xmlns:
xml-prefix=
xml-attribute-value |xmlns=
xml-attribute-value
The former declares xml-prefix
as a namespace alias for
the namespace-uri specified by xml-attribute-value
(which should be a compile-time constant, though an implementation
may allow a general string-valued expression).
The second declares that xml-attribute-value
is the default
namespace for unprefixed element names.
(A default namespace declaration is ignored for attribute names.)
An xml-PI-constructor
can be used to create an XML
processing instruction, which can be used to pass
instructions or annotations to an XML processor or tool.
xml-PI-constructor ::=<?
xml-PI-target xml-PI-content?>
xml-PI-target ::= NCName xml-PI-content ::= any characters, not containing?>
.
For example, the DocBook XSLT stylesheets can use the dbhtml
instructions to specify that a specific chapter should be
written to a named HTML file:
#<chapter><?dbhtml filename="intro.html" ?> <title>Introduction</title> ... </chapter>
You can cause XML comments to be emitted in the XML output document. Such comments can be useful for humans reading the XML document, but are usually ignored by programs.
xml-comment-constructor ::=<!--
xml-comment-content-->
xml-comment-content ::= any characters, not containing--
.
A CDATA
section can be used to avoid excessive
quoting in element content.
xml-CDATA-constructor ::=<![CDATA[
xml-CDATA-content]]>
xml-CDATA-content ::= any characters, not containing]]>
.
A CDATA section is semantically equivalent to text consisting of the xml-CDATA-content, though some implementations may record that the text came from a CDATA so it can be written out the same way.
The following are equivalent:
#<p>Special characters <![CDATA[< > & ' "]]> here.</p> #<p>Special characters < > & " ' here.</p>
The following specifies how the reader syntax is translated by the reader into standard S-expressions. These basically create macro invocations; the implementation is responsible for implementing those macros as described in the Semantics section. As an example:
#<a class="title">Result: &{sum}.</a>is read as if it were:
($xml-element$ () ($resolve-qname$ a) ($xml-attribute$ 'class "title") "Result: " sum ".")
The ()
in the result is the translation
of any namespace declaration attributes - in this case none.
Here is an example with namespace declarations:
#<prefix2:a xmlns:prefix1="URI1" xmlns:prefix2="URI&foo;2" xmlns="DURI">...</prefix2:a>
This is read as:
($xml-element$ ((prefix1 "URI1") (prefix2 "URI" $entity$:foo "2") (|| "DURI")) ($resolve-qname$ a prefix2) ...)
The translation is defined in terms of a recursive read-time
translation function
Tr which maps
an xml-constructor to an S-expression.
Tr[<
QName xml-attribute*>
xml-element-datum*</
QName>
] ⟾ Tr[<
QName xml-attribute*>
xml-element-datum*</>
] Tr[<
QName xml-attribute*/>
] ⟾ Tr[<
QName xml-attribute*></>
] Tr[<
xml-name-form xml-attribute*>
xml-element-datum*</>
] ⟾($xml-element$ (
TrNamespaceDecl[xml-attribute]*)
TrElementName[xml-name-form] TrAttr[xml-attribute]* TrContent[xml-element-datum]*)
TrContent
is as in SRFI-109, except we add this rule:
TrContent[xml-constructor] ⟾ Tr[xml-constructor]
TrAttr[xml-namespace-declaration-attribute] ⟾#|nothing|#
TrAttr[xml-enclosed-expression] ⟾ xml-enclosed-expression TrAttr[xml-name-form=
xml-attribute-value] ⟾($xml-attribute$
TrAttrName[xml-name-form] TrAttrValue[xml-attribute-value])
TrAttrValue["
quot-attribute-datum*"
] ⟾ TrContent[quot-attribute-datum]* TrAttrValue['
apos-attribute-datum*'
] ⟾ TrContent[apos-attribute-datum]* TrAttrValue[[
expression*]
] ⟾ expression* TrAttrValue[(
expression+)
] ⟾(
expression+)
The namespace-declarations are translated to a list of namespace-bindings,
by default the empty list. There is a sub-list for each namespace-binding,
where the first element is the prefix being bound, and the remaining
elements (usually just a single string literal) an expression that evaluates
to a namespace URI. The prefix is a symbol; in the case of a default
element namespace, the prefix is either the empty symbol (||
) or equivalently the reserved prefix name
$default-element-namespace$
.
TrNamespaceDecl[xml-true-attribute] ⟾#|nothing|#
TrNamespaceDecl[xmlns:
xml-prefix=
xml-attribute-value] ⟾(
xml-prefix TrAttrValue[xml-attribute-value])
TrNamespaceDecl[xmlns=
xml-attribute-value] ⟾(||
TrAttrValue[xml-attribute-value])
Element (tag) names are translated by TrElementName
,
while attribute names are translated by TrAttrName
.
These are both handled by TrElementOrAttrName
in both cases.
However, if there is no namespace-prefix, then attribute names default
to the empty namespace, but element names default to the current
default element namespace prefix (indicated by $default-element-namespace$
).
TrElementName[identifier] ⟾($resolve-qname$
identifier)
TrAttrName[identifier] ⟾(quote
identifier)
TrAttrName[other-form] ⟾ TrElementOrAttrName[other-form] TrElementName[other-form] ⟾ TrElementOrAttrName[other-form] TrElementOrAttrName[prefix:
local-name] ⟾($resolve-qname$
local-name prefix)
TrElementOrAttrName[(
expression+)
] ⟾(
expression+)
TrElementOrAttrName[[
expression]
] ⟾ expression
The special node constructors are translated similarly: (Note: This is simplified, since these forms should not handle escape characters the way element and attribute content does.)
Tr[<![CDATA[
xml-CDATA-content]]>
] ⟾($xml-CDATA$
"
xml-CDATA-content")
Tr[<--
xml-comment-content-->
] ⟾($xml-comment$
"
xml-comment-content")
Tr[<?
xml-PI-target xml-PI-content?>
] ⟾($xml-processing-instruction$ "
xml-PI-target"
TrContent[xml-PI-content])
The above translation maps the new reader syntax to
S-expressions using macros specified in this section.
Of course it is possible to write these macro forms directly,
though they are less human-readable. However, code generators
and macros may target these macros. This format can also
be used as an interchange format.
When below we say that an expression "creates an element node"
we may that we create a representation of an element value.
The default implementation should create a unique object,
of a sub-type of XML-node.
However, a keyword such as $xml-element$
may be bound to a user-defined macro,
in which case the element value may be something very different
and perhaps emphemeral, such as a network encoding.
The specification does not define a Scheme API for working with XML data. It assumes there is some data type which we here call an XML-node. This specification does not require the XML-node type to be distinct from other types. Many Scheme XML libraries just use lists to encode XML-nodes. However, newer Schemes that have an extensible type system are encouraged to make XML-node a distinct type. This follows the W3C Document Object Model (DOM).
($xml-element$
(
namespace-binding*)
name attribute* content*)
Creates an element node.
Each namespace-binding
is a list of the form (prefix namespace-uri-part+)
.
The prefix is a literal symbol
that represents a namespace prefix; a zero-length symbol
is equivalent to the symbol $default-element-namespace$
.
Each namespace-uri-part is an expression that
evaluates to a string; the namespace-URI is the concatenation of the parts.
Normally there will be a single namespace-uri-part that is a literal string; there may also be entity references.
An implementation may allow non-literal expressions,
but is not required to.
The name is an expression that
evaluates to a symbol or a QName, most commonly a quoted symbol
or a $resolve-qname$
form.
The binding for
Each attribute is usually
an
Creates an attribute node from the parameters.
The name is an expression that evaluates to a symbol or QName value. The content arguments
are concatenated to produce the attribute value.
Resolve the gives prefix/local-name-pair to a QName value, depending on the currently active namespace bindings. Both arguments are literal
unquoted symbols. If prefix is
missing it defaults
The xml-entity-name is an unquoted symbol.
Returns a string value matching the entity name.
For example:
These serve to delimit enclosed expressions,
but are otherwise ignored in content.
A simple implementation is to bind them to unique objects.
Both element content and attribute values may contain
xml-enclosed-expressions.
These are expressions evaluated at runtime, where the
evaluated result becomes part of the element content or the attribute value.
If the expression evaluates to an element, comment, or
processing node, and the context is element content,
then the node is added as a child of the element.
It is unspecified if the node is copied or shared.
It is also unspecified if the expression result is some
other kind of XML-node, or the context is an attribute value.
If the expression evaluates to a string, the result is pasted
as a text (child) content of an element or a substring of an
attribute value, respectively.
If the expression evaluates to a CDATA segment, the result
is equivalent to the string value of the segment.
If the expression evaluates to some other scalar value
(including numbers, booleans, and characters) the value
is converted to a string according to implementation-specified
rules. An implementation may convert a value as if
using
If the expression evaluates to a list or vector, then each element
is inserted into the element or attribute content. Spaces are inserted
between two elements if neither element is an XML-node.
Note that some XML specifications (include XML Schema and the
XQuery and XPath data model)
have the concept of typed value of a node.
The typed value may be a number, a string, or another
atomic type. The typed value may also be a sequence of strings,
numbers, or other atomic values. Some implementations
may optionally store the typed value instead of or in addition
to the text value. For example:
It is undefined if in the XML-node the contents is stored as a
sequence of four integers, or as the string
If XML-node is a separate data-type, implementations
are encouraged to use this XML-literal format when writing to an output port,
since this provides input-output round-tripping.
Specifically, calling Alternatively, for The implementation is necessarily non-portable, though the
Translation section provides a template for the reader part.
The Kawa Scheme
implementation has working support for this reader extension.
Implementing of the Translated forms
should mostly be obvious: Just call an appropriate function to
create the XML-node. Handling namespace definitions is non-obvious, however.
The form:
The translation is more complicated if in-scope namespace bindings
are part of the run-time properties of the constructed element.
In that case the set of binding needs to be passed to the
The initial environment has predefined:
A simple way to implement
In that case:
could be implemented as:
assuming a 3-argument There is a test suite in the
Kawa source tree.
Copyright (C) Per Bothner 2013
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.$xml-element$
must be a macro, not a function,
because each namespace-binding
adds a (prefix,URI)-binding in the lexical context. That binding is used to evaluate QNames in the remaining parameters, which are all expressions, including the name.
$xml-attribute$
form, but an implementation may support
other expressions that evaluate to attribute nodes
.
Each content is an expression
that evaluates to element content, handled as described in the
Handling of enclosed expressions section.
($xml-attribute$
name content* )
($resolve-qname$
local-name [prefix])
$default-element-namespace$
.
($xml-comment$
content* )
($xml-CDATA$
content* )
Creates a comment, CDATA, or processing instruction (PI) node.
The xml-PI-target should be a string
that matches ($xml-processing-instruction$
xml-PI-target content* )
NCName
.
The content arguments should be strings,
which are concatenated.
$entity$:
xml-entity-name
$entity$:lt ⟹ "<"
$<<$
$>>$
Handling of enclosed expressions
display
.
Alternatively, an implementation may convert
a value to yield a canonical representation according to the XML
Schema specification. (In the latter case, booleans #f
and #t
should yield false
and true
,
respectively.)
#<prices>&(vector 230 599 98 763)</prices>
"230 599 98 763"
,
as long as the result prints the same way.
Output of XML nodes
write
on an XML-node should write
an xml-literal (with an initial
#
).
Calling display
on an XML-node should write
an xml-constructor (without an initial
#
).
The xml-constructor should be in
standard XML syntax without using any of extensions in this specification,
such as an unnamed end-tag,
or an unescaped ]]>
.
In fact, it is strongly recommended that if >
appears in element or attribute content
it should be written as the escaped form >
.
In addition, control characters (in attribute content also including
newline or tab) should be escaped using character references.
display
(not write
), if the
output port is an extended port that can handle rich text
then an
implementation may instead display a styled representation.
For example if the XML-node is compatible with HTML, and the
output port is inserting text into a browser document, then the implementation
may copy the DOM into the browser, perhaps resulting in styled text.
Implementation
Handling namespaces
($xml-element$ ((prefix namespace-uri-part+)*) name attribute* content*)
can be translated into something like:
(let ()
(define-namespace prefix (string-append namespace-uri-part+))*
(make-element name
attribute* content*)
make-element
function.
(define-namespace $default-element-namespace$ "")
define-namespace
is to expand:
(define-namespace prefix namespace-uri)
to:
(define $namespace$:prefix namespace-uri)
($resolve-qname$ local-name prefix)
(make-qname local-name prefix $namespace$:prefix)
make-qname
function that creates a QName
with the given local-name, prefix, and namespace-uri.
Implementations should provide a custom error message in the case
$namespace$:prefix
is
undefined, rather than depend on a generic error message.
Test suite
Copyright
Author: Per Bothner
Editor:
Mike Sperber