Translating Smalltalk to Java

November 1996

There are many Smalltalk application that might benefit from being ported to Java. There are various approaches of interest, including automatic and manual translation from Smalltalk to Java, or a hosting of Smalltalk on top of Java.

The Smalltalk language itself is fairly simple, but Smalltalk includes an extensive set of standard classes. Converting a Smalltalk program includes providing an environment that contains converted versions of the standard Smalltalk classes. Among these classes are a number of "system" classes that interact with the implementation, such as Class, Behavior, CompiledMethod, and so on. So one big question is to what extent so we want these to behave as in a Smalltalk environment, or as in a Java evironment.

A Smalltalk "virtual machine" is typically implemented as a collection of "system" Smalltalk classes, plus various primitive methods implemented in a lower-level language. Smalltalk source for the system classes is freely available from various places, such as GNU Smalltalk. Since we must be able to compile Smalltalk user code, it is reasonable to implement our virtual machine by just compiling an existing set of system classes, and then writing the primitive methods in Java. Such an approach provides very good Smalltalk support, but may provide less integration of Java object with Smalltalk object.

Objects and Classes

Both Java and Smalltalk have class hierarchy where everything inherits from Object. Unfortunately, the two Object classes are different, since they respond to different message protocols. Furthermore, other similar classes (such as Smalltalk's Boolean vs. Java's java.lang.Boolean) are also annoyingly incompatible. Thus for easiest automatic translation of Smalltalk system classes into Java, these Smalltalk classes must be distinct from similar Java classes. (Here I am talking about the first stage of the project. Later I will discuss refinements.) Thus Smalltalk class Foo will be translated into Java class ST.Foo. (This means class Foo in package ST.) The Smalltalk Object class is implemented using the Java class ST.Object. There is no relationship between java.lang.Object and ST.Object, except that ST.Object inherits from java.lang.Object. If a Smalltalk class Foo has fields a and b, and is a sub-class of Bar, the result is:

package ST;
public class Foo extends ST.Bar {
  public ST.Object a;
  public ST.Object b;
};

Note: I assume it is not necessary to support the become: primitive. Doing do would require either an extra level of indirection (which would make the result less efficient and less Java-like), or a very Java-implementation-specific native method implemented in C.

Methods

It is natural to implement a Smalltalk method by translating it to a Java method. Translating the body of a method is a simple matter of translating each sub-expression in turn, and returning the result of the last one. The tricky part is translating a method send, and figuring out how to organize the methods in classes. Translating Smalltalk methods has two major complications:

Smalltalk uses completely dynamic dispatch, using the method selector (a Symbol) to select the correct method. In Java, you can only call a method that is defined in the static type of the receiver; there is no such requirement in Smalltalk. Thus message sends involves searching for the correct method by name.
Smalltalk allows methods to be dynamically replaced and added to a class. This provides for incremental development, and a fast edit-compile-bug cycle. Replacing a method is seldom done for mature code, and so is probably not important for the sake of translating existing Smalltalk code. However, it makes it easier to translating an existing VM.

Using seperate method classes

The natural way to organize the methods of class Foo is that the instance methods of Foo become regular methods of ST.Foo, and the class methods of Foo becomes the static methods of ST.Foo.

Unfortunately, this does not work if you want to be able to dynamically change, add, or delete methods belonging to an existing class. In that case you have to separate out the data structures that implement the methods from the data structures that implement objects and classes. The natural way to do that is to have a separate class for each method. This class would have an internal name generated to be unique, and (in the case of a re-definition) would be loaded using a Java ClassLoader.

One scheme would use a standard CompiledMethods class:

class CompiledMethod extends ST.Object {
  ST.Symbol name;
  ST.Class clas;
  abstract public ST.Object doitN (ST.Object receiver, ST.Object[] args);
};

and each method would be compiled into a sub-class of CompiledMethod.

class Foo_foo extends CompiledMethod {
  public static Object foo (ST.Foo receiver, ST.Object arg1)
  {
    ... compiler code for method foo ...
   }
  public ST.Object doitN (ST.Object receiver, ST.Object[] args)
  {
    return foo ((ST.Foo) receiver, args[0]);
  }
}

Method selection using interfaces

One tempting method of handling dynamic dispatch uses Java interface classes, one for each selector. This works best with a static code base, where you can pre-compile the set of all selectors. For example the selector atAll:put: is associacted with the interface:

package StSelector;
public interface Has_atAll_put {
  public ST.Object atAll_put (Object, Object);
};

Then any class that provides a method with the selector atAll:put: would implement the interface Has_atAll_put. The translation of:

R atAll: C put: O

is then straight-forward:

((StSelector.Has_atAll_put)R).atAll_put (C, O)

There are at least four problems with this approach:

This scheme is not appropriate if you want to be able to add or replace methods in a class. In general, code that manipulates meta-level data structures (such ClassDescriptor or CompiledMethod) will not be well served.

This scheme works best if we do not need to support doesNotUnderstand. If the receiver does not support a method, a casting exception is thrown. The exception could be caught, and converted to a send of doesNotUnderstand, but that would bloat the generated code substantially.

Managing the interface classes in StSelector might be a hassle. Unless you have a fixed code-base (and can pre-generate the complete set of selectors), you have to generate the interface classes as you need them, being careful to not generate a duplicate.

This scheme requires an interface class for every selector, and that each class implement an interface class for every selector it understands. This may be quite a strain on some Java implementations, since they may not be optimized for this usage pattern. We can reduce the problem slightly using special handling for special selectors. For example, we know that the of the selectors defined in Object do not need a interface class, since such a method is defined in ST.Object. Also, if the compiler can do global analysis on the entire code base, if can deduce that selector foo: is only understood by classes that are sub-classes of Foo, then it can dispense with the interface class, and do a cast to ST.Foo instead:

((ST.Foo)R).atAll_put (C, O)

Initial implementation is complicated because we have to re-write much of the low-level implementation classes if we want to start with existing system classes.

Advantages of this scheme:

Does not need to do explicit method searching or caching. This is handled by the Java VM, which can probably do it faster and simpler than we can. (This may not be the case if the Java VM uses poor algorithms to implement interfaces.)

The resulting code is simple, compact, and Java-like. Thus it works well if you want to migrate your code base from Smalltalk to Java, using Java idioms.

Can use type annotations, declarations, or type inference to generate better code.

Method selection by comparing Symbols

One simple way to implement method searching is with a simple virtual "send" method in ST.Object:

package ST;
public class Object {
  ...
  public abstract ST.Object sendN(ST.Object receiver, String selector,
                                  ST.Object[] args);
};

Then for a class Foo that has methods foo and bar:

package ST;
public class Foo extends ST.Object {
  public ST.Object foo(Object arg) { ... }
  public ST.Object bar(Object arg) { ... }
  public ST.Object sendN(ST.Object receiver, String selector,
                         ST.Object[] args)
  {
    // We assume all selector Strings have been intern'd.
    if (selector == "foo")   return foo (args[0]);
    else if (selector == "bar")   return bar (args[0]);
    else return super.sendN (receiver, selector, args);
  }
};

This is simple. The problem is that it is difficult to make it efficient. There is no easy way to do method caching, since methods are not objects. Some optimizations are possible - for example we could use separate send methods for sifferent argument lengths: send0 would be used for messages with no arguments; send1 would be used for messages with one argument; send2 would be used for messages with two arguments; and sendN would be used for messages with three or more arguments.

Method selection by comparing ints

If we can compare ints rather than Symbols, we can use efficient switch statements. The idea is that we number each method in a class, starting at the number of methods in the chain of super-classes. This index uniquely identifies each method in a class. Since we use consequtive indexes, selecting the right class can use an efficient swtach statement. Thus if class T has method foo and bar, and inherits from class S, and there are a totoal of 20 methods in S and its super-classes:

package ST;
public class T extends S. {
  public ST.Object foo(Object arg) { ... }
  public ST.Object bar(Object arg) { ... }
  public ST.Object sendN(ST.Object receiver, int selector,
                         ST.Object[] args)
  {
    switch (selector)
    {
      case 20: return foo (args[0]);
      case 21: return bar (args[0]);
      default:
       // Note that super.sendN could be inlined into this switch.
       return super.sendN (receiver, selector, args);
  }
};

Of course we will still need to map selector Symbols to ints, but we can cache the result.

Using the JDK 1.1 meta-object features

JDK 1.1 will provide a class Method that represents a Java method. A Method object is a first-class Object; it can be extracted from a class object, it can be cached, and it can be invoked. The basic ideas is that a CompiledMethod object would contain a java Method pointer, which would allow us to invoke it without any searching (once we have a CompiledMethod pointer). Of course we still need to find the correct CompiledMethod object, but we can use standard techniques and code from existing StallTalk system classes. One problem is that we do not know when Java implementations support this protocol will be generally available. Another problem is that this feature going "beyond" the core of Java, and need not be very efficient.

Using native Java classes

We can remove the need for ST.Object and just use java.lang.Object if we re-map the methods in Object. For example the Smalltalk X ~~ Y can be directly compiled (with no methods sends) to the Java Boolean.convert(!( X == Y)), where the static method returns the appropriate Boolean object given it argument. Similarly, X = Y can be translated into Boolean.convert(X.equals(Y)).

The tricky part of this is that we also have to translate definitions of = to be definitions of equals.

Similarly, we could replace Boolean with Java.lang.Boolean, Symbol with interned java Strings, and so on. The further go in that direction, the more efficient and ideomatic the code is likely to be. But there are dangers. For example, if someone re-defines = to return a value that is not a Boolean (perhaps they like three-valued logics), they we will probably break their code. And if someone depends on Symbol being a sub-class of String, that will break if we represent Symbol using java.lang.String (since that only inherits from Object).