Drafts

TexInfo is a decent system for writing documentation. Its weakness is the “info” file format, which is an obsolete kludge but replacing it is a multi-pronged task.

The paper attempts to specify a replacement format and associated tooling. I will call this format “hinfo” as it is a replacement for info format using html syntax. (However “hinfo” is not proposed as a file extension; hinfo files should use the .html or .xhtml extensions.)

Problems with info

  • Info is a non-standard format used by no-one else. Hence there is very little tooling.
  • Paragraphs are pre-split into lines, so they cannot adjust to different screen widths.
  • Info requires a monospace font, and so valuable semantic information is lost. Info cannot distinguish @samp and @code. It indicates @var using upper-case. Info-reading programs have limited ability to recover the lost semantic information. Being able to use proportional fonts for descriptive text and monospaced fonts for code and examples makes documentation easier to read.
  • Info documentation looks ugly. Using info as the publicly visible front-end for GNU documentation presents a bare-bones and behind-the-times image for texinfo and GNU generally. This is bad marketing.

It is possible to improve html/DocBook support without deprecating or dropping info format support. However, that has its own problems:

  • Installing both info files and html files wastes disk space.
  • Having two primary formats for documentation is likely to lead to inconsistencies and other problems. What if one package only installs html, another package only installs info, and a third installs both? We would need more complex and brittle installation directory and search path standards.
  • Which format should Emacs info prefer? If you accept what I wrote above, clearly “hinfo” rather than “info”. But in that case, why continue to install info files, or (longer-term) maintain tooling for them?

XHTML as an Info (format) replacement

The obvious replacement for Info is some variant of HTML or XML.

The format should follow the recommendations for Polyglot markup. This means documents are well-formed as both HTML and XML.

If hinfo is well-formed XML then various processing tools (such as XSLT) can be used to analyze or transform the output. For that to be useful, it is a goal that hinfo contain all or most of the semantic information from the texinfo file. Specifically, it should include all the information currently produced by makeinfo --xml or makeinfo --docbook. That can be done with class and other attributes. It is hoped hinfo can make at least makeinfo --xml obsolete.

EPUB uses the epub:type attribute to optionally indicate semantics. Texinfo could do the same.

Note the html currently generated by makeinfo -html is rather poorly structured, and it should be cleaned up. See this thread for discussion on this topic.

The Info UI in a plain browser

It would be useful to have info keyboard shortcuts when reading an hinfo file in a vanilla web browser. Most of the navigation would be trivial to implement using JavaScript. This message discusses a proof of concept.

JavaScript can also add a navigation sidebar, using the makeinfo-generated table-of-contents-file, which is just a nested HTML list. (This is the ToC format used by EPUB3.)

Below I describe a prototype that implements navigation with a “smart” sidebar, and which could be straight-forwardly enhanced to support key-board navigation and searching.

Emacs Info mode

For reading of hinfo file in Emacs we want something with the user interface of traditional info mode, but able to read hinfo files. (It does not need to process the JavaScript in an hinfo file, since elisp can be used instead,)

For the html-reading it makes sense to use eww mode. An “hinfo mode” would be a hybrid of the existing info and eww modes, with the file handling and layout mosting using eww mode, while the keymap and user interface would come from info mode.

The standalone Info program

The standalone info program needs to be able to read hinfo files. There are multiple web-browsers that work in a terminal; one of these can be used.

However, it seems to make more sense to just use emacs in terminal (-nw) mode. We might add an option to Emacs to leave off the menubar and other undesired “decoration”.

There is special case when the terminal emulator is DomTerm, since DomTerm is built on web technologies: In that case we could have DomTerm create an <iframe> and load the html file into it.

Installation of hinfo

There is standard for installing info files (possibly compressed) in a central location so info mode can find just given the name of a manual. The existing standard installs all info files in the same directory.

For hinfo, I think a better structure would be to have a separate directory for each manual. For example:

/usr/share/hinfo
  emacs
    index.html
    Search.html
    ...
  kawa
    index.html
    Tutorial.html
    screenshot-1.png
  ...

Using epub for packaging

To save disk space, it is desirable to compress each manual. The epub format is a standard for electronic books, and it satisfies texinfo’s needs pretty well. An epub file is essentially a zip archive with web pages (xhtml), images, a table of contents, and resources like css styling. There are many epub-reading devices, programs, and browser plugins.

For example the Kawa binary distribution (see final section) ships the texinfo-derived manual in epub format. Kawa includes a --browse-manual option that works by starting a mini-webserver that reads and uncompresses the epub manual, and then displays it in a browser.

Emacs can already process zip archives, including epub files. So it shouldn’t be difficult enhance info mode to deal with epub files.

Recommendation: Change the preferred output format of makeinfo be an epub file. “Installing” an hinfo file would involve copying the epub to a system location. For example:

/usr/share/hinfo
  emacs.epub
  kawa.epub
  ...

If someone wants to publish a manual on the web, they can just unzip the epub file into a server directory.

Prototype of a browser interface

I have implemented a JavaScript package that I believe to be a good baseline for a documentation browser. It has some missing features, most notably keyboard navigation and searching, which I will discuss in the next section. In this section I will focus on the design and features of the existing prototype.

You can try it out at http://per.bothner.com/kawa/invoke/. To download the manual, grab http://per.bothner.com/kawa/kawa-manual.epub. If you read the latter in an epub reader you will get the latter’s user interface, rather than the interface discussed here. However, you can unzip the epub file and browse OEBPS/index.html.

The prototype manual is generated from kawa.texi in a rather convoluted manner, using makeinfo --docbook, plus the DocBook XSLT stylesheets, plus some sed script kludges. I hope in the future we can just do makeinfo --epub, as discussed later.

Smart navigation bar

On startup, the code creates a navigation sidebar. It does this by loading the table-of-contents file into an internal frame (<iframe>). What makes it “smart” is that it only display “interesting” links, rather than displaying the entire table of contents. An “interesting” link is the link to the current node, its ancestors, siblings, and immediate children. As you navigate the document, the sidebar is automatically updated (using JavaScrpt and CSS), rather than being re-loaded.

The sidebar does include a “Table of contents” link, which takes you to the full table-of-contents.

Lazy loading of pages

Certain manuals can be very big, so it is desirable to split them into muliple web pages, and only load pages as needed.

However, this complicates (not-yet-implemented) whole-document search. Navigating back and forth may also cause wasteful re-loading of pages, depending on the browser’s caching strategy,

The implemented solution is to load each page into its own <iframe> (internal frame). The “master page” stays loaded as long as the document is being read (i.e. its window is open) and you don’t navigate from it. When it starts up, it creates an empty placeholder <div> element for each page (using the table-of-contents file loaded into the sidebar). When a page is first visited, an <iframe> is created for the page, and added as a child of the placeholder <div>.

A click event handler overrides the default action of internal (same-document) links, by if necessary creating the <iframe>, and then hiding all the other pages, using CSS styling.

When a page is first loaded, same-document links are re-written (so “hover” shows the correct URL), while external links get a target="_blank" attribute, so they open in a fresh window/tab.

Clean fallback when JavaScript or CSS is missing/disabled

The initial “welcome” page is not loaded in an <iframe>, but is directly in the top-level index.html. The main reason for this is to have a clean fall-back if JavaScript is missing or disabled. A secondary reason is to speed up loading of the welcome page.

The initial contents of the welhome are moved into a <div> element, to make it easy to hide it when navigating to another page.

Works using http or file

The same set of files should be readable using either directly using file: URLs, when served by a web server (http: or https), or other mechanism (such as packaged in an archive).

Browsers enforce a “same-origin” security policity, which limits interaction between frames. The Google Chrome browser views file: frames as having different origins, so communicating between frames is restricted. The solution is to use postMessage to communicate between frames.

Clean bookmarkable URLs

When a page is selected, we want to update the browser’s location bar so it contains a URL for that page. However, in general we can only update the hash part of the location, without causing the entre page to be re-loaded.

Thus when navigating to the Buffers page in the Emacs manual, we can’t update emacs/index.html to emacs/Buffers.html. However, we can change the location bar to end with emacs/index.html#Buffers. This is what the prototype does. Such URLs work in external links, because when the document is initially loaded, the prototype checks for a “hash” string. If one is specified, the corresponding page is loaded.

The browser’s Back button does not yet work, but that can presumably be fixed with the history mechanism.

Newer browers have a history feature which gives you more flexibility in updating the location bar. Thus you could update it to say (for example) emacs/Buffers.html. However, this is limited by the “same-origin” policy: Updating http: URLs works on both Firefox and Google Chrome; file: URLs work on Firefox; file: does not work on Google Chrome.

It may make sense to use the emacs/Buffers.html style for http: and https:, but use the emacs/index.html#Buffers style otherwise. Especially for existing websites that have established URLs using the emacs/Buffers.html style.

This would require some modest changes to startup code: Now any page can be the initial “master page”, so index.html is not special.

Use index.html is for intial page

It is desirable that the initial pathname ends with index.html because web servers generally have that as the default. This initial page is the same as the “master page” mentioned before. Instead of http://example.com/docs/emacs/start.html use http://example.com/docs/emacs/index.html, which allows you to appreviate it as http://example.com/docs/emacs. The Buffers page could be accessed as either http://example.com/docs/emacs/index.html#Buffers or http://example.com/docs/emacs/#Buffers (preferred).

Unimplemented features of the browser interface

Keyboard navigation

We should be able to read, navigate and search using the keyboard only. To the extent possible the key bindings (at least the default ones) should match those of the info program or mode.

Note that some commands will need to request a string to be typed by the user. Do not create a popup window for this. Instead create a temporary input field inside an absolutely-positioned <div> on top of the regular context. This <div> can contain an <input> element or just set contenteditable.

Allow whole-document search

Info mode and the info program allow you to “search for a sequence of characters throughout an entire Info file”.

Using an <iframe> per page, as in the prototype, enables searcing the whole document. It works by having the master page load any pages that haven’t yet be loaded, and then sending a message to each page to have it search its own contents. Each page reports the search result to a master page (using another postMessage), which takes action based on those result.

A scrolling interface

A different style option (depending on user preference) could conceptually show all the pages as one big page, allowing scrolling between them. Pages that haven’t been loaded yet would initially be represented by a empty placeholder of approximately correct height. The page is automatically loaded if the placeholder is scrolled into view.

HTML or XHTML

The existing prototype uses files with the xhtml extension for most of the pages, with the exception of index.html. This is because ereaders expect xhtml files, and the DocBook stylesheets generate that.

Using xhtml files is mostly invisible when using the prototype, because of URLs it makes visible look like emacs/index.html#Buffers. However, if we use URLs of the form emacs/Buffers.html then it would be preferable for the actual files to have the html extension. And this would be less confusing in general.

The EPUB3 specification states content files SHOULD use the file extension .xhtml., but note it does not say MUST. What it does require is that the documents must meet the conformance constraints for XML documents and be be an [HTML5] document that conforms to the XHTML syntax. With a little care, makeinfo can generate an html file that conform to both HTML and XML, and that should be our goal.

navigation with Back button

This should be implemented using the history mechanism.

Student Projects

The following two projects should be suitable for Google Summer of Code or similar.

Implement –xhtml and –epub output formats

The makeinfo program converts texinfo into a number of different output formats, including html and docbook. I see this this project as these parts:

  1. Create a new output format xhtml, similar to html. However, each output file has the xhtml extension, and must conform to the syntax for the xhtml variant of HTML5. Links should use the id attribute (not <a name="N">).

    This would be a hybrid of the existing html and xml output formats.

  2. Clean up the generated xhtml so it is well-structured, and the logical structure follows the xhtml follows the structure of the texinfo.

    We also want xhtml format to preserve all the “interesting” information present in the texinfo source. This is so it can be used for further processing using xml tools such as xslt, and allow flexible styling with css. By presering “interesting” information we mean whatever information is currently emitted by the existing xml or docbok output formats.

  3. Another new output format possibly called --phtml for “polyglot HTML”. This could be the same as --xhtml format, but using .html file extensions. The important thing is that each file be valid both as XML and HTML.

    At some point in the future, the --html option could be switched to act as --phtml.

  4. Create a new output format epub output format. This is essentially the same as the xhtml or phtml output format, but all the output files (including any image files) are packaged in an epub archive, which is essentially just a zip archive with a few extra files, such as table of contents.

    Generating file with extension .html (and thus using phtml format) has practical benefits, in that one can distribute documentation as epub, and then unzip it to yield html files. (A web server can of course do this on-the-fly.)

The makeinfo program is written in Perl, so this project will require some familiarity with Perl.

Improve JavaScript navigation

The main goal is to re-implement the user-interface features of the terminal-based info program (such as convenient keyboard navigation from a document), but in the context of a web browser displaying xhtml, as generated by the previous project.

The high-level summary:

  1. Start with the above-referenced prototype.
  2. Implement basic (non-search) keyboard navigation, similar to the info program.
  3. Implement search-based navigation, again similar to the info program.
  4. Implement an option for URLs to have the form emacs/Buffers.html, rather than emacs/index.html#Buffers. (This should only be used for http: or https:.)

For more details, see the above discussion.

If the previous project (to create xhtml format) has not been completed, you can use the above-referenced Kawa manual as a test-bed.

This project requires some familiarity with JavaScript, CSS, and DOM, or interest in learning more about these technlogies.

Created 26 Jan 2017 18:45 PST. Last edited 26 Jan 2017 18:45 PST. Tags:
[!meta title="Smart string substitution in Kawa"]] Many tools are controlled by a text-based domain-specific language (DSL). These include SQL, JSON, various XML-based languages, and of course various shell languages. Sometimes you want to invoke tools from a programming language, and so you construct text commands in this DSL. Thiese commands typically have a fixed (literal) template, which is filled in with context-dependent data. This data is commonly strings, which become string literals in the DSL, which means the data has to quoted/escaped to have the appropriate syntax for the DSL. If you fail to quote, or do it wrong, you risk bugs; if the data comes from an untrusted source, you risk a code injection vulnerability. ((unquoted))
Created 29 Nov 2013 20:05 PST. Last edited 30 Dec 2013 15:12 PST. Tags:

This note is a rationale and design discussion of a number Kawa feature for more powerful string and other literals. This feature aims to satisfy a number of related needs.

  • Using Kawa string literals for long multi-line strings is awkward. One problem is that the same delimiter (double-quote) is used for both the start and end of the string. This is error-prone and not robust: adding or removing a single character changes the reading of the entire rest of the program. A related problem is that the delimiter needs to be quoted using an escape character, which can get hard-to-read.

    A common solution is a here document, where distinct multi-character start and end delimiters are used. For example the Unix shell uses uses << followed by an arbitrary token as the start delimiter, and then the same token as the end delimiter:

    tr a-z A-Z <<END_TEXT
    one two three
    uno dos tres
    END_TEXT
    

    This proposal would use:

    (string-upcase #&[
    one two three
    uno dos tres
    ])
    
  • Commonly one wants to construct a string as a concatenation of literal text with evaluated expressions. Using explicit string concatenation (Scheme string-append or Java's + operator) is verbose and can be error-prone. Using format is an alternative, but it is also a bit verbose, and has the problem that the format specifier in the string is widely separated from the expression. Nicer is to be able to use Variable interpolation, as in Unix shells:
    echo "Hello ${name}!"
    

    This proposal uses the syntax:

    #&[Hello &{name}!]
    

    Note that & is used for two different related purposes: Part of the prefix #&[ to mark the entire string, and as an escape character for the variable interpolation. This will be justified shortly.

  • Going one step further, template processor has many uses. Examples include BRL and JSP, which are both used to generate web pages.

    The simple solution is to allow general Kawa expressions in substitutions:

    #&[Hello &{(string-capitalize name)}!]
    

    You can also leave out the curly braces when the expression is a parenthesized expression:

    #&[Hello &(string-capitalize name)!]
    

    Note that this syntax for unquoted expressions matches that used in Kawa's XML literals.

  • The Scribble system defines a language for writing documents. It is a kind of template processor with embedded expression in the Racket Scheme dialect. The general Racket syntax for an embedded expression is:
    @cmd[datum ...]{text-body}
    
    Kawa switches the roles of {} and [] to be compatible with XML-literals, and also because {} is more commonly used for anti-quotation. Kawa uses special characters, so it becomes:
    &cmd{datum ...}[text-body]
    

    This form translates to:

    (cmd datum ... #&[text-body])
    
  • Camp4 quotation SRFI-10 Example: #&sql[select * from person where firstname != ${ignore}]
Created 28 Aug 2012 17:32 PDT. Last edited 30 Aug 2012 16:12 PDT. Tags:

The cost of allocating and garbage-collecting heap objects is usually justified - but it is none-the-less often non-trivial. Furthermore, the concern about the cost (whether justified or not) causes programmers to write more convoluted code and design more complicated APIs. Some object-oriented languages (most notably C++ and C#) provide for struct types that may be stack-allocated and optimized in various ways. There has long been interest in such a mechanism for Java; here are some thoughts.

Examples

Co-ordinates, rectangles, and affine transforms in graphics programming.

Complex numbers.

Arbitrary-precision integer, as a pair of a 32-bit plus a pointer to a extenstions array. (The latter is non-null only when it is needed.)

Tagged immediate object types (as in unboxed fixnums), by using a pair of a int (or long) combined with a pointer.

A Swing Segment.

Document position and nodes (as in Swing or XML).

Iterators.

Design issues

A value type must be 'final'.

Mutable or immutable?

Equality and identity.

Boxing of structs.

Arrays of structs.

Backwards compatibility: Can we design the feature so it is "optional", in that older or more limited VMs can correctly execute bytecodes, that will be efficiently executed on supporting VMs?

Created 6 Dec 2009 13:08 PST. Last edited 20 Jun 2010 23:16 PDT. Tags:

A program language should have (at least) have these two kinds of comments:

  • Comment extends to the end of the line.
  • Comment extends to a end comment delimiter. Such comments should nest, unlike the Java /* ... */ comments.

An interesting option for nestable comments is for the start delimiter to be #!. The end delimiter could be !#. This allows:

#!/bin/sh
exec kawa --options "$0" "$@"
!#
(define ....)
Created 14 Mar 2007 17:36 PDT. Last edited 13 Mar 2009 20:11 PDT. Tags:

A pattern can matched against a value. If it matches, one or more variables may be bound to some part of the matched value.

Patterns can be used in various declaration contexts, include variable declaration, parameter declarations, and cases of a switch expression.

Abstract pattern grammar

Here is a classifications of patterns why should support. The concrete syntax is not fixed.

Variables

The simplest pattern is a variable. This declares that variable, and it is bound to the value being matched against. Question: It may make sense to use a special syntactic marker to indicate a variable being declared, as opposed to being used.

Type specification

pattern!type

Conjunction

pattern1&pattern2...
This matches pattern1 against the target, possibly binding some variables. Then pattern2 is matches against the same target. The pattern2 may contain use variables bound in pattern1. Commonly, pattern2 will be a predicate or a type-specifier. In fact, perhaps having a special syntax for conjunction may not be useful, since it can be expressed using a predicate.

Predicate

{boolean-expression}
This matches if the boolean-expression evaluates to true. Typically, boolean-expression may contain variables declared previously in a conjunction. In fact, we could combine the syntaxes:
{pattern|boolean-expression}

Constructor

constructor-name(pattern1, pattern2, ...)
Created 14 Mar 2007 17:36 PDT. Last edited 12 Aug 2010 14:41 PDT. Tags:

Up: Kawa

Running commands

(run command arg ...)
The command is an executable program or script, and the arg are command line arguments. (For now leave it open if these evaluated or quoted.)

The result of standard output of the command is effectively redirected to a temporary file, and the contents of this file, viewed as a string or text object, becomes the result of the run expression. If the output consumer for the run is an output port then the command's standard output is re-directed to that port. In the initial case, the output consumer is the standard output stream of the containing JVM, so no redirection is needed.

The standard error output of the command is piped to the current error port. If the error port matches the initial error port, no re-direction is needed.

The standard input of the command is connected to the current input port of the dynamic context.

Discussion: An alternative is to define run so the output from the comm and is written to the current output port. One could then re-ify the output from a command with some kind of a with-output-to-string macro.

File name expansion

(glob regexp)

Return a set of Path values that match the regexp, as multiple values. The can be interpolated in a run argument list.

Created 3 Feb 2007 11:49 PST. Last edited 13 Mar 2009 20:10 PDT. Tags:
Extending Qexo/Kawa for updates

A number of people are interesing in extending XQuery for updates. Here are some useful notes.

Updating means at least two different things: Modifying an in-core node object, and modifying a node in a persistent xml data database. They're very different. Let's start with the former.

Qexo's node model

You might want to read the gnu.lists package descriptor for an overview of the concepts of Kawa's sequence and node objects. A node, in the XML sense, is represented as a pair of an AbstractSequence and an index. The index (a position value) is just a unique number managed by the AbstractSequence. There are a number of implementation classes that extend AbstractSequence, and use different ways of managing position indexes. The one used for XML nodes is a NodeTree, which is an extension of TreeList. The nodes of a document or document fragment are all in a a single NodeTree; each node is identified by a position index, which basically an index in the TreeList's data array, but with the lower-order bit used as a special flag. (See the above-mentioned descriptor in gnu.list.) When we need to create an object for a node, we use a KNode object. The idea is that most nodes aren't actively referenced, so we don't need an actual KNode object, which saves a lot of space.

Updating nodes in-place

To implement updating a node object in-memory we need to finish the update/insert/delete abilities in gnu.lists.TreeList. The latter class is basically a gap-vector (as used in emacs and Swing), but the data structures are more complicated because it stores a hierarchy, rather than just characters. Once we can update the TreeList, we will need an extra level of indirection. The reason is that node identity is tied to the position indexes, but editing a NodeTree causes the nodes in it move around. The solution is to use either StableVector or something similar. Unfortunately, StableVector doesn't currently support TreeList. Perhaps TreeList should be changed to extend GapVector.

A more abstract way to think of it: A Node needs to be a pair of a NodeManager and an index that is managed by the NodeManager. The actual underlying storage is in a TreeList, but since indexes in a TreeList change on updates, the actual Node indexes are indexes in the NodeManager. Each time you read a property of a node, you use the node's index, which is an index in the NodeManager. You use that index in the NodeManager position array, which gives us an index in the TreeList, and get the value from the latter. To update a node, we have to similarly dereference the index in the NodeManager to get an index the the TreeLists's data array, and update the latter. That may require things to move around in the TreeList, so the indexes in the NodeManager have to be updated.

Moving nodes from one document or fragment to another is tricky. The reason is that node indexes are relative to a TreeList. One solution is to use forwarding pointers. Another is a NodeManger that can handle multiple TreeLists.

Updating XML databases

Updating a XML files or a database is more complicated. One approach is reading an XML document, updating nodes in-memory, and writing out the modified document. That is practical for modest-sized XML documents, but expensive for small changes to large documents. Another issue is that it is difficult (but not impossible) to maintain node identity between the original document and the updated version, even for nodes that are unmodified.

Ideally, one would like to modify individial nodes in-place in the database. Thsi is doable in the Kawa node model. The basic idea is to create a AbstractSequence sub-class, which we might call DatabaseDocument. The DatabaseDocument would be a proxy for either the entire database, or an individual xml document. Each node has a database key. The DatabaseDocument object manages the mapping between position indexes and database keys.

Note there are positions of the Qexo run-time that assume nodes are implemented using NodeTree. They would have to be fixed to support general AbstractSequences.

Of course once one is updating a database we also have to deal with transactions and related ACID issues.

Created 3 Feb 2007 10:26 PST. Last edited 13 Mar 2009 20:10 PDT. Tags:

Use the standard \ to escape special characters, in both string literals, and outside. In general (outside string literals) a \ followed by a non-letter character makes that character be treated as a letter. E.g. \1\+2 is a 3-character identifiers consisting of the characters 1, +, and 2, even if the languages normally otherwise doesn't allow identifiers to start with digits or to contain +.

Letters don't need to be escaped, in either identifiers or names. So we're free to use \ followed by a letter for other purposes, including the standard C string escapes. I suggested at least the following:

\xNNNN - A Unicode escape. Terminated by the first character that is neither a digit or a letter. If that character is a space, it is ignored. Only a single space is ignored.

\n - A newline.

...

The string form of regular expressions should be compatible with this convention.

Created 3 Feb 2007 10:21 PST. Last edited 13 Mar 2009 20:11 PDT. Tags:
Tags: