Q2 - extensible syntax with indentation

Q2 is the (temporary?) name of a project where I'm exploring some ideas in programming languages. At this point it is not a complete or coherent language, but you might find some ideas of interest. This article looks at the overall syntactic structure of Q2 programs. Future articles will look at other issues (declarations and patterns; logic variables and unification; sequences and loop comprehensions) as they get implemented.

To try the examples get the SVN development sources and build them. You can then run Q2 code by passing the --q2 language switch on the kawa command line, or using the .q2 file extension. There are some examples in the gnu/q2/testsuite directory.

Extensible syntax with no reserved identifiers

There are many virtues of traditional Lisp/Scheme syntax, primarily a minimal fixed core syntax, combined with a very flexible and extensible structural syntax. There are no reserved identifiers, and the pre-defined syntax for definitional forms and control flow is not hard-coded: they can in principle be defined in the language itself, and you can replace and add to them.

Q2 follows these goals: There is a fixed core (lexical) syntax of tokens and delimiters, which is read by a (Lisp-like) reader. Name-resolution and high-level syntactic forms are processed later at macro-expansion time.

Support infix operators with precedence

Infix operators with multiple precedence levels may not be as elegant as Lisp-style prefix notation, but most people are familiar with the former, and it seems to make for somewhat more readable programs. Thus Q2 supports it. This expression:

b := x + 1 > 3 + y * 2

means the same as:

b := ((x + 1) > (3 + (y * 2)))

Assignment (set!) uses the operator :=, while numerical equality uses == (for now). The plan is to use = for definitional equality - and ultimately unification.

Infix operators appear to conflict with the goal extensible syntax with no reserved identifiers, if we allow user-defined operator tokens, operator tokens are just regular identifiers and which could be re-defined (as say a function), and we allow forward references (at least of functions). How can we tell that * is the built-in binary operator, or if it gets redefined a function later in the script? The solution is for the lexical parser (reader) to just return a list of tokens. Splitting the token list according to operator precedence is done later - just like macro-expansion is.

Use juxtaposition for function application

Haskell and ML use juxtaposition for function application, rather than parentheses and commas. For example to call the atan function with two arguments x and y:

atan x y

This is compact, easy to type, and easy to read. It is similar to the syntax of many command languages including sh.

Naming a zero-argument function applies it

The complication is how do you distinguish a variable reference from a call to a zero-argument function? This can get tricky in a language with first-class functions, since the value of a variable can be a function value. (This is not an issue in Haskell or other pure lazy function language, since there is no difference between applying a zero-argument function and accessing a lazily-initialized variable.) Some languages use a special syntax or operator for force function application of zero-argument functions, but I think that is backwards, and inconsistent with command languages. For example, if foo is a function that can take zero or one arguments, then

foo1     # Called with zero arguments
foo1 10  # Called with one argument

The exact rule for a distinguishing between a variable reference and a zero-argument function application isn't decided yet. For now it just depends on the run-time type of the identifier: If the value of the variable is a function that can take zero arguments, apply the function, otherwise return the value. An alternative rule is to apply the function only if the identifier is lexically bound to a function definition.

Flexible token format

Expressions are constructed from lexical tokens. A token can be (among other things) a name (an identifier), a literal, or an operator. Supporting an extensible syntax with infix operators suggests that an operator should just be an identifier using non-alphanumeric characters. Extensibility is also helped if programmers can add new kinds of literals. So it seems like a good idea be be rather flexible and open-ended in terms of what characters can comprise a token or an identifier. Many languages only allow alphanumeric characters (with perhaps one or two extra characters allowed), but Lisp-family languages traditionally allows a wider class of characters, include arithmetic symbols like +. Q2 follows this tradition.

The downside is that more spaces are needed to separate adjacent tokens. That seems a good trade-off - the spaces are probably a good idea for readability, regardless. Furthermore, if we're using spaces to separate function arguments, it seem reasonable to also require spaces to separate infix operators from operands.

Use indentation for grouping

Some languages (including Haskell and Python) indicate block structure based on the indentation of the lines of a program. I think this makes sense - it allows for a cleaner layout with fewer redundant delimiters (redundant because good programmers use indentation anyway). It also avoids the common annoyance of noise lines that only contain a closing delimiter.

However, it is more of a challenge to make use of indentation in a Lisp-like language without a fixed syntax. There are some experiments where essentially a pre-processor adds parentheses based on indentation, but just using indentation just as syntactic sugar for parentheses doesn't make for nice-looking programs.

The basic use of indentation is Q2 is simple. For example:

f1 a b c d
  f2 e f
    f3 g
    h
  f4 i j

is equivalent to:

f1 a b c d (f2 e f (f3 g) (h)) (f4 i j)

I.e. there are implicit grouping parentheses around each line and each following line that is indented more.

We'd like to add local variable and function definitions inside an indented block, but first a digression.

Block expressions yield multiple values

Many languages provide blocks, consisting of zero or more declarations, expressions, and other statements. In an expression language, the value of the block is usually the value of the final expression. In Q2 instead the value of the block is the value of all as a tuple or multiple values. For example:

(10-7; 2+2; 4+1)

The above is a block with 3 sub-expressions, which evaluates to 3 values: 3, 4, and 5. If this is a top-level expression then all 3 values are printed. Note that the parentheses are just for grouping, so are optional if this is the outermost expression.

The semi-column is a special delimiter that cannot appear in identifiers, but semantically it is a binary infix operator that concatenates the values of the operands.

Note that multiple values are not the same as sequences or lists: for one thing multiple values do not nest:

((1;2); (3; 4))

This evaluates to 4 values, not 2.

If we think of declaration, assignments, and other statements as expressions that return zero results, then it all works out. This example (which uses the old-style Scheme define for now) evaluates to a vector with a single element valued 10:

vector (define v 5; v + v)

Newlines are roughly equivalent to semi-colons in that they concatenate multiple values. Here is an example of the the top-level of a script, which is a block of 4 sub-expressions:

define x 10
x + 2
define y (3 * x)
x + y

This evaluates to 4 sub-results, where the two declarations yield zero results, and the additions yield 12 and 40, for a total of 2 values.

Multiple values becomes multiple arguments

If an argument expression evaluates to multiple values, these becomes multiple arguments in the function application:

fun2 (3;4) (5;6)

evaluates to the same result as:

fun2 3 4 5 6

Indentation as blocks

Now let's slightly change a previous example to include an inner local variable definition:

f1 a b
  f2 e f
    define x (g + 10)
    2 * x
  f4 i j

This works nicely if we slightly change the rewrite rules to make use of blocks. Roughly, each newline becomes a semicolon. The result is:

f1 a b c d (f2 e f (define x (g + 10); 2 * x); f4 i j)

Each block can evaluate to zero, one, or multiple values. The inner block has two sub-expressions: The define provides 0 values, and the other yields a single number. Thus f2 is called with 3 arguments: e, f, and the value of 2 * x. The other block consist of the calls two f2 and f4. Assuming these each return a single value, as do a and b, then f1 is called with 4 parameters.

Works well with a REPL

While the read-eval-print-loop will need more tuning, the current behavior is quite reasonable. Typing in commands consisting of one or more space-separated tokens on a line is convenient and familiar from shells. (In fact building a command shell on top of Q2 is a tempting project.)

In an interactive REPL there is an issue of whether to read ahead to check the indentation of the following line, or in general to see if the current command is terminated. Reading the following line before evaluating and printing the current command of course makes for a confusing experience, so the REPL uses the single-line mode of the parser: Reading the following lines is avoided unless we're nested inside parentheses or the current commands ends with a binary operator.

Scheme macro and special form compatibility

For now Q2 support most of the syntax of Scheme tokens, and most Scheme builtins, including lexical forms. This is not necessarily a long-term goal, but it provides something we can use until Q2-specific replacements are been designed and implemented.

For example here is a possible definition of the foo1 used above:

define counter 0
define (foo1 #!optional (x 10)) (
  counter := counter + 1
  format #t "foo called x=~s counter=~s!~%~!" x counter)

Tags: Scheme kawa

Per Bothner's blog