Per Bothner
<per@bothner.com>
This SRFI is currently in ``draft'' status. To see an explanation of
each status that a SRFI can hold, see here.
To provide input on this SRFI, please
mail to
<srfi minus 109 at srfi dot schemers dot org>
. See
instructions here to
subscribe to the list. You can access previous messages via
the archive of the mailing list.
This specifies a reader extension for extended string quasi-literals, including nicer multi-line strings, and enclosed unquoted expressions.
This proposal is related to SRFI-108 (extended string quasi-literals) and SRFI-107 (XML reader syntax), as they share quite a bit of syntax.
This proposal aims to aid in a number of related problems relating to string literals.
Standard Scheme literals are awkward for multi-line strings.
One problem is that the same delimiter (double-quote) is used for both
the start and end of the string. This is error-prone and not robust:
adding or removing a single character changes the meaning of the entire
rest of the program.
A related problem is that if the delimiter appears in the string it
needs to be quoted using an escape character, which can get hard-to-read.
If we have distinct start and end delimiters, then we only
need to escape unbalanced
use of the delimiters.
A common solution is a
here document
,
where distinct multi-character start and end delimiters are used.
For example the Unix shell
uses uses <<
followed by an arbitrary token
as the start delimiter, and then the same token as the end delimiter:
tr a-z A-Z <<END_TEXT one two three uno dos tres END_TEXT
This proposal uses just #&{
and }
as the default start and end delimiters, respectively:
(string-upcase &{ one two three uno dos tres })
Commonly one wants to construct a string as a concatenation of
literal text and evaluated expressions.
Using explicit string concatenation (Scheme string-append
or Java's +
operator)
is verbose and can be error-prone.
Using format
is an alternative, but it is also a bit verbose.
Worse, the format specifier and expression it controls
are non-adjacent, which is awkward and error-prone.
Nicer is to be able to use
Variable interpolation, as in Unix shells:
echo "Hello ${name}!"
This proposal uses the syntax:
&{Hello &[name]!}
Note that &
is used both
as part of the prefix &{
to mark the entire string, and as an escape character within the string.
See the discussion
SRFI-108 (delimiter options).
Going one step further, a template processor has many uses. Examples include BRL and JSP, which are both used to generate web pages.
The simple solution is to allow general Scheme expressions in substitutions:
&{Hello &[(string-capitalize name)]!}
You can also leave out the square brackets when the expression is a parenthesized expression:
&{Hello &(string-capitalize name)!}
Note that this syntax for unquoted expressions matches that used in SRFI-107 (XML reader syntax).
By default there is a one-to-one mapping between whitespace in the literal and the resulting string (except that line-ending is normalized to the newline character), but it is often convenient (or at least prettier) for them to be different.
You can of course easily add extra newline characters beyond those in the literal:
&{a&newline;b} ⟹ "a\nb"
Conversely, the line-continuation marker
&-
is used to suppress a newline:
&{abc&- def} ⟹ "abc def"
The marker also suppresses any intraline whitespace between
the &-
and the newline,
but it does not suppress intraline whitespace
following the newline.
In the latter respect it differs from the \
at the end of a line in an R6RS string literal.
Suppressing initial whitespace is more generally useful than
just for continuation lines. For example it is important for properly
indenting source code to match the program structure.
The indentation marker &|
is used to mark the end of insignificant initial whitespace,
typically to indent strings inside a function.
The &|
characters and all the preceding
whitespace are removed:
(display (string-upcase &{ &|one two three &|uno dos tres }) out)
As a matter of style, all of the indentation lines should line up: An implementation may warn if indentation is inconsistent. It is an error if there are any non-whitespace characters between the previous newline and the indentation marker. It is also an error to write an indentation marker before the first newline in the literal.
One does not normally want an initial newline in a multi-line string.
However, as in the above example, the natural way to write this
is with the left brace on the previous line - otherwise either
the source is wrongly
indented, or the matching columns
in the result don't line up in the source.
For that reason &|
also suppresses an initial newline.
Specifically, when the initial left-brace is followed by
optional (invisible) intraline-whitespace, then a newline,
then optional intraline-whitespace (the indentation), and
finally the indentation marker &|
- all of which is removed from the output.
Otherwise the &|
only removes
initial intraline-whitespace on the same line (and itself).
However, traditionally there should be a final newline in a multi-line string. So the following styles are suggested. If the text is at top-level, or more generally, the closing brace is in the first column, then write it like this:
(define help-message &{ &|This is the first of 2 lines. &|This last line is followed by a final newline. })
When the text is nested such that writing the closing brace should not be in the left column, then you can use an extra indentation marker, like this:
(display (string-upcase &{ &|This is the first of 2 lines. &|This last line is followed by a final newline. &|}) out)
Note in the above there are 3 indentation markers, but the resulting string has 2 lines followed by a total of 2 newline characters, because the first indentation markers suppresses the initial newline.
If you do not want to not end the final line with a newline, you can either use a line-continuation marker, or end the line with the closing brace:
(display (string-upcase &{ &|This is the first of 2 lines. &|This last line is not followed by a final newline.}) out)
For long strings it may be useful to embed comments, even though this is redundant since it could be done using enclosed expressions:
&{preamble &[#|ignore this part|#] postamble}
However, this seems clumsy, so this specification has a comment syntax:
&{preamble &#|ignore this part|# postamble}
For example for line numbers:
(display (string-upcase &{ &|&#|line 1|#one two &|&#|line 2|# three &|&#|line 3|#uno dos tres }) out)
(It is temping to allow comments before a
&|
indentation marker,
but it entails more complexity that seems justified.)
We support the standard XML syntax for character references,
using either decimal or hexadecimal values.
The following string has two instances of the Ascii escape character,
as either decimal 27 or hex 1B
:
&{}
You can also use the pre-defined XML entity names:
&{& < > " '} ⟹ "& < > \" '"
In addition, {
}
can be used for left and right curly brace:
&{}_{} ⟹ "}_{"
Note that these are only needed for unbalanced braces:
&{A left brace '{' followed by a right brace '}' is ok.} ⟹ "A left brace '{' followed by a right brace '}' is ok."
An implementation must support the character names amp
,
lt
, gt
, quot
,
apos
, lbrace
, and rbrace
.
An implementation should support
the standard XML entity names
(though resource-limited or non-Unicode-based implementations
are not required to). For example:
&{Lærdalsøyri} ⟹ "Lærdalsøyri"
An implementation should also support the standard
R7RS character names null
, alarm
,
backspace
, tab
, newline
,
return
, escape
, space
,
and delete
. For example:
&{&escape;&space;}
The reader translates the entity reference
&name;
to the variable reference $entity$:name
.
Therefore user-defined entity names are possible:
(define $entity$:crnl "\r\n") &{&crnl;} ⟹ "\r\n"
This section discusses some ideas that seem worthwhile, but need more thought, so are deferred for now.
Only the characters '{'
, '}'
, and
'&'
are reserved and thus need special escaping.
Braces only need escaping when unbalanced, which is likely
to be rare in both text and quoted programs, thus the only
real problem is &
.
A common solution in other languages is doubling.
That is one could read &&
as
a single &
. However, doubling is not otherwise
used in Scheme, so it may not be worth adding as a special case.
It might convenient to support standard string single-character slash escapes in some form, For example:
&{Hello!&\r&\n} ⟹ "Hello\r\n"
Maybe not really needed, since one could just write:
&{Hello&["\r\n"]}
Many Scheme implementations use format
for
finer-grained control of the output. A problem with format
is that the association between format specifiers and data expressions
is positional, which is hard-to-read and error-prone.
A better solution places the specifier adjacant to the data expression:
&{The response was &~,2f(* 100.0 (/ responses total))%.}
The reader would map this to:
($string$ "The response was " ($format$ "~,2f" (* 100.0 (/ responses total))) "%.")
A simple definition of $format$
:
(define ($format$ fmt . args) (apply format #t fmt args))
Implementations that support
printf
-style formatting can also optionally support those:
&{The response was &%.2f(* 100.0 (/ responses total))%.}
This would be read as:
($string$ "The response was " ($sprintf$ "%.2f" (* 100.0 (/ responses total))) "%.")
(The JavaFX Script language provided similar functionality.)
Internationalization refers to a framework so that
text messages can be emitted in multiple (human) languages,
depending on the user's preferred locale.
See SRFI-29.
Strings that may need to be translated are marked specially.
For the sake of discussion we can use the prefix ^
followed by a key:
&^hello{Hello!}
Here the key is the string hello
. At runtime this key
is combined with the current language
to produce a translated string.
If no translation is found, then the string in the literal Hello!
is used.
If there is no explicit key, the string is used as the key.
In the following, "Hello!"
is used as the key.
&^{Hello!}
A simple implementation of $format$
as
a call to the format
function
does not handle format specifiers that change the
argument order.
These are primarily useful for localizing messages,
since one might want change argument order when translating
from one language to another. Consider this warning message:
&^{['&[partition]' has only &[avail] bytes free.}
A translation might want to re-order the arguments, as if it were:
&^{Only &[avail] bytes free on '&[partition]'.}
That could be done if the translation database provides for a format that re-orders the arguments, perhaps using the tilde-asterisk format specifier forms. For example (to pick some hypothetical translation database syntax):
"'&[]' has only &[] bytes free." => "Only &~1@*~d[] bytes free on '&~0@*~s[]'."
It follows that we can't use a one-to-one translation from
a format-specifier ($format$
) to a call to the
format
function. Instead we need to work with
single format string constructed from the entire text to be localized.
The complicates the implementation.
The basic algorithm should be something like:
&[]
if there is no format-specifier,
and &[specifier]
if there is one.
gettext
-style).
Look for a translation in the translation database.
If one is found, use that as the translated text-part;
otherwise use text-part as-is.
~
characters.
Replace each &[]
by ~a
,
and each &[specifier]
by the specifier
,
format
with the resulting format string and
the enclosed expressions as the arguments.
Many languages, including the Bourne shell,
allow for a a user-defined end token.
We could allow the as an option following a marker
character - for example !
:
(string-upcase &!END-TEXT{ one two three uno dos tres }!END-TEXT)
Sometimes you want to insert all the values of a vector or list
in an enclosed-part.
I.e. you want to splice
the elements of the list/vector
into the result string. This is similar to
the splicing of a list in quasi-quotation. It seems reasonable
to use the same prefix character @
.
Thus:
(define exp (list e1 e1 ... en)) &{_&[@exp]_}
should be equivalent to:
&{_&[e1 e2 ... en]_}
This can be implemented using the ~{
~}
iteration format specifiers from
Common Lisp, if the implementations supports those:
&{_&~{~a~}[exp]_}
expression ::= ... | extended-string-literal
extended-string-literal ::=&{
initial-ignored? string-literal-part*}
string-literal-part ::= any character except&
,{
or}
|{
string-literal-part*}
| char-ref | entity-ref | special-escape | enclosed-part char-ref ::=&#
digit+;
|&#x
hex-digit+;
entity-ref ::=&
char-or-entity-name;
char-or-entity-name ::= tagname initial-ignored ::= intraline-whitespace line-ending intraline-whitespace&|
special-escape ::= intraline-whitespace&|
|&
nested-comment |&-
intraline-linespace line-ending enclosed-part ::=&
enclosed-modifier[
expression*]
|&
enclosed-modifier(
expression+)
tagname ::= tagname-initial tagname-subsequent* tagname-initial ::= letter tagname-subsequent ::= tagname-initial | digit |-
(hyphen) |_
(underscore) |.
(period)
If we allowed tagname to be an
arbitrary Scheme identifier there would be parsing difficulties.
One problem is that we use &|
to skip
indentation, but R7RS identifier syntax uses |
as a delimiter for symbols with special characters.
Another conflict is if an implementation uses
&~
or
&%
to indicate format specifiers,
since these are allowed as R7RS identifier initial characters.
An implementation may extend tagname to match Name as defined by the XML 1.1 specification.
The following are defined by R6RS: nested-comment, intraline-whitespace, line-ending, letter, digit, and hex-digit.
enclosed-modifier ::= empty
An enclosed-modifier is normally empty:
However, implementations or future extensions may support non-empty modifiers.
For example, Kawa supports both format
-style
and printf
-style specifiers, so the syntax is:
enclosed-modifier ::= empty |~
format-specifier-after-tilde (optional feature) |%
format-specifier-after-percent (optional feature)
When the Scheme reader reads an extended-string-literal
it returns a list whose first element is the symbol $string$
,
and whose remaining elements are the translations of the string-literal parts.
The literal content (including each
char-ref but excluding each
entity-ref) is translated to
literal strings.
An entity-ref
&ename;
is translated to a
symbol $entity$:ename
.
Enclosed expressions are prefixed by a $<<$
symbol¸ and followed by a $>>$
.
The translation is defined by conceptual
read-time re-write function
Tr
which maps an extended-string-literal
in the input stream to an equivalent $string$
list - which
is then (conceptually) re-read. (A real reader would generate
S-expression forms directly, but this way we can express the
translation more concisely.)
Tr[&{
initial-ignored? content-segment*}
] ⟾($string$
TrContent[content-segment]*)
Each segment
corresponds to a
string-literal-part in the syntax,
except that a run of multiple plain characters and
char-refs are combined to a single
string literal. In addition the special-escape
forms are dropped without appearing in the result.
TrContent[simple-text+] ⟾"
TrText[simple-text]+"
TrText[any character except&
, or\
, line-ending, or final (unbalanced)}
] ⟾ that character as-is TrText[line-ending] ⟾\n
TrText[\
] ⟾\\
TrText[&#x
hex-digit+;
] ⟾\x
hex-digit+;
TrText[&#
digit+;
] ⟾\x
corresponding hex-digits;
TrText[&
nested-comment] ⟾TrText[intraline-whitespace
&|
] ⟾TrText[
&-
intraline-whitespace line-ending] ⟾
Translations for the other segment kinds are straight-forward:
TrContent[&
ename;
] ⟾$entity$:
ename TrContent[&(
expression+)
] ⟾$<<$ (
expression+) $>>$
TrContent[&[
expression*]
] ⟾$<<$
expression*$>>$
The following are optional and/or for a future specification:
TrContent[&~
format(
expression+)
] ⟾($format$ "
format" (
expression+))
TrContent[&~
format[
expression+]
] ⟾($format$ "
format"
expression+)
TrContent[&%
format(
expression+)
] ⟾($sprintf$ "
format" (
expression+))
TrContent[&%
format[
expression+]
] ⟾($sprintf$ "
format"
expression+)
($string$ form ...)evaluates approximately to an immutable string created by concatenating each form. A basic implementation could be:
(define ($string$ . args) (let ((port (open-output-string))) (for-each (lambda (arg) (if (and (not (eq? arg $<<$)) (not (eq? arg $>>$))) (display arg port))) args) (get-output-string port)))
The string created by a $string$
form is immutable,
and need not have a unique identity. E.g. if the operands
are constant then an implementation is allowed to constant-fold
the expression to a string literal.
In addition $<<$
$>>$
are
bound to unique objects, distinct from each other or other objects
(as determinted by eq?
). These bindings
should preferbly be non-assignable if an implementation has
a mechanism for that (for example using identifier macros).
(define $<<$ (make-string 0)) (define $>>$ (make-string 0))
Note that R6RS and R7RS-draft allows eq?
to return #t
for distinct calls to (make-string 0)
.
A implementation that does so needs to initialize $<<$
and $>>$
some other way.
If $format$
is supported, a minimal implementation is:
(define-syntax $format$ (syntax-rules () (($format$ fmt arg ...) (format #f fmt arg ...))))
Since this specification changes the reader format, and there is no standard Scheme way to do that, there is no portable implementation. However, this specification is being implemented in Kawa. (Check out the development version using Subversion.)
A more sophisticated implementation of the $string$
macro
which maps to a single format
call is
at the time of writing
in syntax.scm.
Copyright (C) Per Bothner 2013
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.