List Comprehension Patterns

Many programming languages have some variant of list comprehension syntax. For example Haskell:

[ 2 * x | x <- [2,3,5,7,11]]

The part after the | is the scan specifier x <- [2,3,5,7,11], where the scan variable x ranges over the elements of the scan sequence [2,3,5,7,11]. The result expression 2 * x is evaluated for each element of the resulting sequence, using the scan variables from the scan specifier(s).

Kawa's new (and experimental) List Comprehension Patterns separate these two parts by making the scan specifier part of the more general pattern-matching feature. The result expression can be widely separated from the scan specifier, as long is it within the lexical scope of any scan variable it uses. For example they can be in separate REPL commands (the #|kawa:N# is the REPL prompt).

First, some simple variable definions, using the Kawa-specific ! (which is like define extended to patterns):

#|kawa:1|# (! A [2 3 5 7 11])
#|kawa:2|# (! [A0 A1 A2 A3 A4] A)

The first line binds the variable A to the result of evaluating the sequence literal [2 3 5 7 11]. The second illustrates matching the pattern [A0 A1 A2 A3 A4] against the value of A, which causes A0...A4 to be matched against corresponding elements of A:

#|kawa:3|# (+ A1 A4)
14

Finally, here is a list comprehension (or scan) pattern:

#|kawa:4|# (! [a ...] A)

This declares a to be a scan variable that ranges over the elements of the scan sequence A. You can only use a scan variable inside a scan context:

#|kawa:5|# a
/dev/stdin:2:1: using scan variable a while not in scan context

A scan context is an expression that is followed by ellipsis. The expression is evaluated once for each value of the corresponding scan variable(s), and the result sequence is spliced into its context. Here the scan context is the expression (* 2 a), which doubles the corresponding element a:

#|kawa:6|# [(* 2 a) ...]
[4 6 10 14 22]

This use of ellipsis is inspired by the syntax-rules macro transformers of Scheme.

A scan variable can be used multiple times in the same scan context (ellipsis expression):

#|kawa:7|# [a ... a ...]
[2 3 5 7 11 2 3 5 7 11]
#|kawa:8|# [(* a a) ...]
[4 9 25 49 121]

Ellipses expressions are useful not just in sequence literals, but in the argument list of a procedure call, where the resulting sequence is spliced into the argument list. This is especially useful for functions that take a variable number of arguments, because that enables a convenient way to do fold/accumulate/reduce operations. For example:

#|kawa:9|# (+ a ...)
28

because 28 is the result of (+ 2 3 5 7 11).

An elegant way to implement dot product:

(define (dot-product [x ...] [y ...])
  (+ (* x y) ...))

When an ellipse expression references two or more distinct scan variables then they are scanned in parallel. That does not (necessarily) imply muliple threads, but that the first element of the scan result is evaluated using the first element of all the scan sequences, the second element of the result uses the second element of all the scan sequences, and so on.

Sub-patterns in ellipsis patterns

The form before ... in a pattern is itself a (sub-)pattern. For example, the sub-pattern may include a type specifier, which applies to each element:

#|kawa:11|# (define (isum [x::integer ...]) (+ x ...))
#|kawa:12|# (isum [4 5 6])
15
#|kawa:12|# (isum [4 5.1 6])
Argument #1 (null) to 'isum' has wrong type
	at gnu.mapping.CallContext.matchError(CallContext.java:189)
	at atInteractiveLevel-6.isum$check(stdin:11)
	...

(The stack trace line number stdin:11 is that of the isum definition.)

You can nest ellipsis patterns, allowing matching against sequences whose elements are sequences.

#|kawa:31|# (define (fun2 [[x ...] ...] [y ...])
#|.....32|#   [[(+ x y) ...] ...])
#|kawa:33|# (fun2 [[1 2 3] [10 11 12]] [100 200])
[[101 102 103] [210 211 212]]

Note that x is double-nested, while y is singly-nested.

The following does not work at time of writing, but will soon:

(! [[x y] ...] sequence-of-pairs)  ;;Not yet working

Filtering the scan sequence

Often you only want to use some of the elements from the scan sequence. For example SQL's select has a where with a boolean expression; only elements for which the expession is true are included. With scan patterns we have the option to filter the values either where the scan variables are introduced (the pattern), or when they are used.

(The examples in this section are not working - and the design is undecided.)

Kawa has a splice operator @expr that takes a sequence and spreads it out to multiple arguuments. (It is similar to JavaScripts's spread syntax ...expr.) It is useful to allow splice operators in scan expressions, and even more useful to allow them in one or more branches of an if. The following expressions selects the odd elements from a, doubling them, but drops even elements. It does the latter by returning a 0-length splice for each even element:

[(if (odd? a) (* 2 a) @[]) ...]

Some reasonable syntatic sugar would be to allow leaving out the else expression of an if, having it default to a zero-length splice:

[(if (odd? a) (* 2 a)) ...]

APL has an expand operator that repeats elements of an array based on another count array. This functionality can be achived using splices in scan expressions. For example, the following repeats each odd element, and skips even elements:

[(if (odd? a) @[a a] @[]) ...]

Equivalently (parsing issues aside):

[@(if (odd? a) [a a] []) ...]

It is also useful to filter at the pattern level. The following hypothetical syntax would create a scan variable a from those elements of the scan sequence A that satisfy the boolean expression (P a).

(! [a ...] A #!if (P a)) ;; hypothetical

Note the sequence of a values used to evaluate (P a) would be different than the values in scan expressions. That might be an issue.

Outer product

Instead of processing multiple sequences in parallel, sometimes you want the multiplicative combination of elements from all the sequences. This is called Cartesian product (or outer product in APL), or an cross join in database query languages.

This could be handled with a (not implemented) outer-product function:

(outer-product [10 20] [1 2 3]) ⇒
  [[10 1] [10 2] [10 3] [20 1] [20 2] [20 3] [30 1] [30 2] [30 3]]

We could use it like this:

(! [[x y] ...) (outer-product X Y))

The above syntax is a bit awkward. Some syntatic sugar would help. If so, it should include an option for filtering only the desired combinations based on a boolean expression.

Comprehension as looping

Scan expression can form the basis of a looping framework in a programming language. Assume a function ignore that ignores its arguments. Then this is one way to write a for loop, using its scan expression for its side effects:

(let (([i ...] [0 <: 10]))
  (ignore (do-something-with i) ...))

The scan valuable i range of the range expression [0 <: 10]. The scan expression (do-something-with i) is evaluated for each i.

A general looping construct needs a way to exit the loop based on a condition, as in a while loop. Than can be done by having a scan expression return a special #!done value.

This assumes the scan elements are evaluated sequentially, which is a reasonable default. It would be useful to have some way to request that evaluation be done in parallel. This could be implemented by translating the code to use Java 8 streams, though for performance reasons we only want to do that when parallel execution is wanted.

Implementation notes

A scan variable is implemented as either a java.util.List or a (Java) array. Using an array may be more efficient, partly because it easier to avoid boxing; on the other hand it may require more copying.. Scan patterns are expanded by the compiler to a loop which initializes the array. Scan expressions expand to map calls, which are generally inlined to loops.

Numerous optimizations and performance improvements are possible.

Tags: Kawa

Per Bothner's blog