Shell-style programming in Kawa

A number of programming languages have facilities to allow access to system processes (commands). For example Java has java.lang.Process and java.lang.ProcessBuilder. However, they're not as convenient as as the old Bourne shell, or as elegant in composing commands.

If we ignore syntax, the shell's basic model is that a command is a function that takes an input string (standard input) along with some string-valued command-line arguments, and whose primary result is a string (standard output). The command also has some secondary outputs (including standard error, and the exit code). However, the elegance comes from treating standard output as a string that can be passed to another command, or used in other contexts that require a string (e.g. command substitution). This article presents how Kawa solves this problem.

Process is auto-convertable to string

To run a command like date, you can use the run-process procedure:

#|kawa:2|# (define p1 (run-process "date --utc"))

Equivalently you can use the &`{command} syntactic sugar:

#|kawa:2|# (define p1 &`{date --utc})

But what kind of value is p1? It's an instance of gnu.kawa.functions.LProcess, which is a class that extends java.lang.Process. You can see this if you invoke toString or call the write procedure:

#|kawa:2|# (p1:toString)
gnu.kawa.functions.LProcess@377dca04
#|kawa:3|# (write p1)
gnu.kawa.functions.LProcess@377dca04

An LProcess is automatically converted to a string or a bytevector in a context that requires it. More precisely, an LProcess can be converted to a blob because it implements Lazy<Blob>. This means you can convert to a string (or bytevector):

#|kawa:9|# (define s1 ::string p1)
#|kawa:10|# (write s1)
"Wed Jan  1 01:18:21 UTC 2014\n"
#|kawa:11|# (define b1 ::bytevector p1)
(write b1)
#u8(87 101 100 32 74 97 110 ... 52 10)

However the display procedure prints it in "human" form, as a string:

#|kawa:4|# (display p1)
Wed Jan  1 01:18:21 UTC 2014

This is also the default REPL formatting:

#|kawa:5|# &`{date --utc}
Wed Jan  1 01:18:22 UTC 2014

Command arguments

The general form for run-process is:

(run-process keyword-argument... command)

The command is the process command-line. It can be an array of strings, in which case those are used as the command arguments directly:

(run-process ["ls" "-l"])

The command can also be a single string, which is split (tokenized) into command arguments separated by whitespace. Quotation groups words together just like traditional shells:

(run-process "cmd a\"b 'c\"d k'l m\"n'o")
   ⇒ (run-process ["cmd"   "ab 'cd"   "k'l m\"no"])

Using string templates is more readable as it avoids having to quote quotation marks:

(run-process &{cmd a"b 'c"d k'l m"n'o})

You can also use the abbreviated form:

&`{cmd a"b 'c"d k'l m"n'o}

This syntax is the same as of SRFI-108 named quasi-literals. In general, the following are roughly equivalent (the difference is that the former does smart quoting of embedded expressions, as discussed later):

&`{command}
(run-command &{command})

Similarly, the following are also roughly equivalent:

&`[keyword-argument...]{command}
(run-command keyword-argument... &{command})

A keyword-argument can specify various properties of the process. For example you can specify the working directory of the process:

(run-process directory: "/tmp" "ls -l")

You can use the shell keyword to specify that we want to use the shell to split the string into arguments. For example:

(run-process shell: #t "command line")

is equivalent to:

(run-process ["/bin/sh" "-c" "command line"])

You can can also use the abbreviation &sh:

&sh{rm *.class}

which is equivalent to:

&`{/bin/sh -c "rm *.class"}

In general, the abbreviated syntax:

&sh[args...]{command}

is equivalent to:

&`[shell: #t args...]{command}

Command and variable substitution

Traditional shells allow you to insert the output from a command into the command arguments of another command. For example:

echo The directory is: `pwd`

The equivalent Kawa syntax is:

&`{echo The directory is: &`{pwd}}

This is just a special case of substituting the result from evaluating an expression. The above is a short-hand for:

&`{echo The directory is: &[&`{pwd}]}

In general, the syntax:

...&[expression]...

evaluates the expression, converts the result to a string, and combines it with the literal string. (We'll see the details in the next section.) This general form subsumes command substitution, variable substitution, and arithmetic expansion.

Tokenization of substitution result

Things gets more interesting when considering the interaction between substitution and tokenization. This is not simple string interpolation. For example, if an interpolated value contains a quote character, we want to treat it as a literal quote, rather than a token delimiter. This matches the behavior of traditional shells. There are multiple cases, depending on whether the interpolation result is a string or a vector/list, and depending on whether the interpolation is inside a quotes.

If the value is a string, and we're not inside quotes, then all non-whitespace characters (including quotes) are literal, but whitespace still separates tokens:
```
(define v1 "a b'c ")
&`{cmd x y&[v1]z}  ⟾  (run-process ["cmd" "x" "ya" "b'c" "z"])
```
If the value is a string, and we are inside single quotes, all characters (including whitespace) are literal.
```
&`{cmd 'x y&[v1]z'}  ⟾  (run-process ["cmd" "x ya b'c z"])
```
Double quotes work the same except that newline is an argument separator. This is useful when you have one filename per line, and the filenames may contain spaces, as in the output from find:
```
&`{ls -l "&`{find . -name '*.pdf'}"}
```
If the string ends with one or more newlines, those are ignored. This rule (which also applies in the previous not-inside-quotes case) matches traditional shell behavior.
If the value is a vector or list (of strings), and we're not inside quotes, then each element of the array becomes its own argument, as-is:
```
(define v2 ["a b" "c\"d"])
&`{cmd &[v2]}  ⟾  (run-process ["cmd" "a b" "c\"d"])
```
However, if the enclosed expression is adjacent to non-space non-quote characters, those are prepended to the first element, or appended to the last element, respectively.
```
&`{cmd x&[v2]y}  ⟾  (run-process ["cmd" "xa b" "c\"dy"])
&`{cmd x&[[]]y}  ⟾  (run-process ["cmd" "xy"])
```
This behavior is similar to how shells handle "$@" (or "${name[@]}" for general arrays), though in Kawa you would leave off the quotes.
Note the equivalence:
```
&`{&[array]}  ⟾  (run-process array)
```
If the value is a vector or list (of strings), and we are inside quotes, it is equivalent to interpolating a single string resulting from concatenating the elements separated by a space:
```
&`{cmd "&[v2]"}  ⟾  (run-process ["cmd" "a b c\"d"])
```
This behavior is similar to how shells handle "$*" (or "${name[*]}" for general arrays).
If the value is the result of a call to unescaped-data then it is parsed as if it were literal. For example a quote in the unescaped data may match a quote in the literal:
```
(define vu (unescaped-data "b ' c d '"))
&`{cmd 'a &[vu]z'}  ⟾  (run-process ["cmd" "a b " "c" "d" "z"])
```
If we're using a shell to tokenize the command, then we add quotes or backslashes as needed so that the shell will tokenize as described above:
```
&sh{cmd x y&[v1]z}  ⟾  (run-process ["/bin/sh" "-c" "cmd x y'a' 'b'\\'''c' z'"])
```

Smart tokenization only happens when using the quasi-literal forms such as &`{command}. You can of course use string templates with run-process:

(run-process &{echo The directory is: &`{pwd}})

However, in that case there is no smart tokenization: The template is evaluated to a string, and then the resulting string is tokenized, with no knowledge of where expressions were substituted.

Input/output redirection

You can use various keyword arguments to specify standard input, output, and error streams. For example to lower-case the text in in.txt, writing the result to out.txt, you can do:

&`[in-from: "in.txt" out-to: "out.txt"]{tr A-Z a-z}

or:

(run-process in-from: "in.txt" out-to: "out.txt" "tr A-Z a-z")

These options are supported:

in: value

The value is evaluated, converted to a string (as if using display), and copied to the input file of the process. The following are equivalent:

&`[in: "text\n"]{command}
&`[in: &`{echo "text"}]{command}

You can pipe the output from command1 to the input of command2 as follows:

&`[in: &`{command1}]{command2}

in-from: path

The process reads its input from the specified path, which can be any value coercible to a filepath.

out-to: path

The process writes its output to the specified path.

err-to: path

Similarly for the error stream.

out-append-to: path

err-append-to: path

Similar to out-to and err-to, but append to the file specified by path, instead of replacing it.

in-from: 'pipe

out-to: 'pipe

err-to: 'pipe

Does not set up redirection. Instead, the specified stream is available using the methods getOutputStream, getInputStream, or getErrorStream, respectively, on the resulting Process object, just like Java's ProcessBuilder.Redirect.PIPE.

in-from: 'inherit

out-to: 'inherit

err-to: 'inherit

Inherits the standard input, output, or error stream from the current JVM process.

out-to: port

err-to: port

Redirects the standard output or error of the process to the specified port.

out-to: 'current

err-to: 'current

Same as out-to: (current-output-port), or err-to: (current-error-port), respectively.

in-from: port

in-from: 'current

Re-directs standard input to read from the port (or (current-input-port)). It is unspecified how much is read from the port. (The implementation is to use a thread that reads from the port, and sends it to the process, so it might read to the end of the port, even if the process doesn't read it all.)

err-to: 'out

Redirect the standard error of the process to be merged with the standard output.

The default for the error stream (if neither err-to or err-append-to is specifier) is equivalent to err-to: 'current.

Note: Writing to a port is implemented by copying the output or error stream of the process. This is done in a thread, which means we don't have any guarantees when the copying is finished. A possible approach is to have to process-exit-wait (discussed later) wait for not only the process to finish, but also for these helper threads to finish.

Here documents

A here document is a form a literal string, typically multi-line, and commonly used in shells for the standard input of a process. Kawa's string literals or string quasi-literals can be used for this. For example, this passes the string "line1\nline2\nline3\n" to the standard input of command:

(run-process [in: &{
    &|line1
    &|line2
    &|line3
    }] "command")

The &{...} delimits a string; the &| indicates the preceding indentation is ignored.

Pipe-lines

Writing a multi-stage pipe-line quickly gets ugly:

&`[in: &`[in: "My text\n"]{tr a-z A-Z}]{wc}

Aside: This would be nicer in a language with in-fix operators, assuming &` is treated as a left-associative infix operator, with the input as the operational left operand:
"My text\n" &`{tr a-z A-Z} &`{wc}

The convenience macro pipe-process makes this much nicer:

(pipe-process
  "My text\n"
  &`{tr a-z A-Z}
  &`{wc})

All but the first sub-expressions must be (optionally-sugared) run-process forms. The first sub-expression is an arbitrary expression which becomes the input to the second process expression; which becomes the input to the third process expression; and so on. The result of the pipe-process call is the result of the last sub-expression.

Copying the output of one process to the input of the next is optimized: it uses a copying loop in a separate thread. Thus you can safely pipe long-running processes that produce huge output. This isn't quite as efficient as using an operating system pipe, but is portable and works pretty well.

Setting the process environment

By default the new process inherits the system environment of the current (JVM) process as returned by System.getenv(), but you can override it.

env-name: value

In the process environment, set the "name" to the specified value. For example:

&`[env-CLASSPATH: ".:classes"]{java MyClass}

NAME: value

Same as using the env-name option, but only if the NAME is uppercase (i.e. if uppercasing NAME yields the same string). For example the previous example could be written:

(run-process CLASSPATH: ".:classes" "java MyClass")

environment: env

The env is evaluated and must yield a HashMap. This map is used as the system environment of the process.

Process-based control flow

Traditional shell provides logic control flow operations based on the exit code of a process: 0 is success (true), while non-zero is failure (false). Thus you might see:

if grep Version Makefile >/dev/null
then echo found Version
else echo no Version
fi

One idea to have a process be auto-convertible to a boolean, in addition to be auto-convertible to strings or bytevectors: In a boolean context, we'd wait for the process to finish, and return #t if the exit code is 0, and #f otherwise. This idea may be worth exploring later.

Currently Kawa provides process-exit-wait which waits for a process to exit, and then returns the exit code as an int. The convenience function process-exit-ok? returns true iff process-exit-wait returns 0.

(process-exit-wait (run-process "echo foo")) ⟾ 0

The previous sh example could be written:

(if (process-exit-ok? &`{grep Version Makefile})
  &`{echo found}
  &`{echo not found})

Note that unlike the sh, this ignores the output from the grep (because no-one has asked for it). To match the output from the shell, you can use out-to: 'inherit:

(if (process-exit-ok? &`[out-to: 'inherit]{grep Version Makefile})
  &`{echo found}
  &`{echo not found})

Tags: Scheme kawa

Links: text-and-binary-data

Per Bothner's blog