Q2 - extensible syntax with indentation
Q2
is the (temporary?) name of a project
where I'm exploring some ideas in programming languages.
At this point it is not a complete or coherent language,
but you might find some ideas of interest.
This article looks at the overall syntactic structure of Q2 programs.
Future articles will look at other issues (declarations and patterns;
logic variables and unification; sequences and loop comprehensions) as they
get implemented.
To try the examples get the SVN development sources
and build them. You can then run Q2 code by passing the --q2
language switch on the kawa
command line, or using the .q2
file extension.
There are some examples in the gnu/q2/testsuite
directory.
Extensible syntax with no reserved identifiers
There are many virtues of traditional Lisp/Scheme syntax, primarily a minimal fixed core syntax, combined with a very flexible and extensible structural syntax. There are no reserved identifiers, and the pre-defined syntax for definitional forms and control flow is not hard-coded: they can in principle be defined in the language itself, and you can replace and add to them.
Q2 follows these goals: There is a fixed core (lexical) syntax of tokens
and delimiters, which is read by a (Lisp-like) reader
.
Name-resolution and high-level syntactic forms are processed
later at macro-expansion time
.
Support infix operators with precedence
Infix operators with multiple precedence levels may not be as elegant as Lisp-style prefix notation, but most people are familiar with the former, and it seems to make for somewhat more readable programs. Thus Q2 supports it. This expression:b := x + 1 > 3 + y * 2means the same as:
b := ((x + 1) > (3 + (y * 2)))Assignment (
set!
) uses the operator :=
,
while numerical equality uses ==
(for now).
The plan is to use =
for definitional equality
- and ultimately unification.
Infix operators appear to conflict with the goal extensible syntax
with no reserved identifiers
, if we allow user-defined operator tokens,
operator tokens are just regular identifiers and which could be re-defined (as say a function),
and we allow forward references (at least of functions).
How can we tell that *
is the built-in binary
operator, or if it gets redefined a function later in the script?
The solution is for the lexical parser (reader) to just return
a list of tokens. Splitting the token list according to
operator precedence is done later - just like macro-expansion is.
Use juxtaposition for function application
Haskell and ML use juxtaposition for function application, rather than parentheses and commas. For example to call theatan
function with
two arguments x
and y
:
atan x y
This is compact, easy to type, and easy to read.
It is similar to the syntax of many command languages
including sh
.
Naming a zero-argument function applies it
The complication is how do you distinguish a variable
reference from a call to a zero-argument function?
This can get tricky in a language with first-class functions,
since the value of a variable can be a function value.
(This is not an issue in Haskell or other pure lazy function language,
since there is no difference between applying a zero-argument function
and accessing a lazily-initialized variable.)
Some languages use a special syntax or operator for force function application
of zero-argument functions, but I think that is backwards, and
inconsistent with command languages.
For example, if foo
is a function that can take
zero or one arguments, then
foo1 # Called with zero arguments foo1 10 # Called with one argument
The exact rule for a distinguishing between a variable reference and a zero-argument function application isn't decided yet. For now it just depends on the run-time type of the identifier: If the value of the variable is a function that can take zero arguments, apply the function, otherwise return the value. An alternative rule is to apply the function only if the identifier is lexically bound to a function definition.
Flexible token format
Expressions are constructed from lexical tokens.
A token can be (among other things) a name (an identifier),
a literal, or an operator.
Supporting an extensible syntax with infix operators suggests that an operator
should just be an identifier using non-alphanumeric characters.
Extensibility is also helped if programmers can add new kinds of literals.
So it seems like a good idea be be rather flexible and open-ended
in terms of what characters can comprise a token or an identifier.
Many languages only allow alphanumeric characters (with perhaps one or
two extra characters allowed), but
Lisp-family languages traditionally allows a wider class of characters,
include arithmetic symbols like +
.
Q2 follows this tradition.
The downside is that more spaces are needed to separate adjacent tokens. That seems a good trade-off - the spaces are probably a good idea for readability, regardless. Furthermore, if we're using spaces to separate function arguments, it seem reasonable to also require spaces to separate infix operators from operands.
Use indentation for grouping
Some languages (including Haskell and Python) indicate block structure based on the indentation of the lines of a program. I think this makes sense - it allows for a cleaner layout with fewer redundant delimiters (redundant because good programmers use indentation anyway). It also avoids the common annoyance of noise lines that only contain a closing delimiter.
However, it is more of a challenge to make use of indentation in a
Lisp-like language without a fixed syntax. There are some experiments
where essentially a pre-processor adds parentheses based on indentation,
but just using indentation just as syntactic sugar
for parentheses doesn't make for nice-looking programs.
The basic use of indentation is Q2 is simple. For example:
f1 a b c d f2 e f f3 g h f4 i jis equivalent to:
f1 a b c d (f2 e f (f3 g) (h)) (f4 i j)I.e. there are implicit grouping parentheses around each line and each following line that is indented more.
We'd like to add local variable and function definitions inside an indented block, but first a digression.
Block expressions yield multiple values
Many languages provide blocks, consisting of zero or more
declarations, expressions, and other statements.
In an expression language, the value of the block
is usually the value of the final expression.
In Q2 instead the value of the block is the value of all
as a tuple or multiple values
. For example:
(10-7; 2+2; 4+1)The above is a block with 3 sub-expressions, which evaluates to 3 values: 3, 4, and 5. If this is a top-level expression then all 3 values are printed. Note that the parentheses are just for grouping, so are optional if this is the outermost expression.
The semi-column is a special delimiter that cannot appear in identifiers,
but semantically it is a binary infix operator that concatenates
the values of the operands.
Note that multiple values are not the same as sequences or lists: for one thing multiple values do not nest:
((1;2); (3; 4))This evaluates to 4 values, not 2.
If we think of declaration, assignments, and other statements
as expressions that return zero results, then it all works out.
This example (which uses the old-style Scheme define
for now)
evaluates to a vector with a single element valued 10:
vector (define v 5; v + v)
Newlines are roughly equivalent to semi-colons in that they concatenate multiple values. Here is an example of the the top-level of a script, which is a block of 4 sub-expressions:
define x 10 x + 2 define y (3 * x) x + yThis evaluates to 4 sub-results, where the two declarations yield zero results, and the additions yield 12 and 40, for a total of 2 values.
Multiple values becomes multiple arguments
If an argument expression evaluates to multiple values, these becomes multiple arguments in the function application:fun2 (3;4) (5;6)evaluates to the same result as:
fun2 3 4 5 6
Indentation as blocks
Now let's slightly change a previous example to include an inner local variable definition:f1 a b f2 e f define x (g + 10) 2 * x f4 i j
This works nicely if we slightly change the rewrite rules to make use of blocks. Roughly, each newline becomes a semicolon. The result is:
f1 a b c d (f2 e f (define x (g + 10); 2 * x); f4 i j)Each
blockcan evaluate to zero, one, or multiple values. The inner block has two sub-expressions: The
define
provides 0 values, and the other yields a single number.
Thus f2
is called with 3 arguments: e
,
f
, and the value of 2 * x
.
The other block consist of the calls two f2
and f4
.
Assuming these each return a single value, as do a
and b
,
then f1
is called with 4 parameters.
Works well with a REPL
While the read-eval-print-loop will need more tuning, the current behavior is quite reasonable. Typing in commands consisting of one or more space-separated tokens on a line is convenient and familiar from shells. (In fact building a command shell on top of Q2 is a tempting project.)In an interactive REPL there is an issue of whether to read ahead to check the indentation of the following line, or in general to see if the current command is terminated. Reading the following line before evaluating and printing the current command of course makes for a confusing experience, so the REPL uses the single-line mode of the parser: Reading the following lines is avoided unless we're nested inside parentheses or the current commands ends with a binary operator.
Scheme macro and special form compatibility
For now Q2 support most of the syntax of Scheme tokens, and most Scheme builtins, including lexical forms. This is not necessarily a long-term goal, but it provides something we can use until Q2-specific replacements are been designed and implemented.
For example here is a possible definition of the foo1
used above:
define counter 0 define (foo1 #!optional (x 10)) ( counter := counter + 1 format #t "foo called x=~s counter=~s!~%~!" x counter)