Saturday, May 25, 2013

Status update: Loyc tree rewritten

I have made and implemented the changes that I planned to make to the Loyc tree. The code was rewritten from scratch; the new code is not as fancy or complete, but it's easier to use.
  • Red nodes are gone (and GreenNode is now called LNode, for "Loyc Node")
  • Mutable nodes are gone, and the data type for Args and Attrs has changed to RVList<T> (Reverse V-List--since this tends to make people think of Recreational Vehicles, I'm thinking about changing the name to simply VList, with the forward VList still being named FVList). RVList provides a natural way to support the common pattern of gathering a list of statements and then freezing them into a read-only node; you can create a list with RWList<LNode>, mutate it however you like (similar performance to List<T>), and then when the list is complete, call ToRVList() to convert it to RVList<LNode> instantly.
  • There are now three kinds of nodes: symbols (now called Ids, i.e. identifiers), literals, and calls. I'm considering whether to add a fourth type for nodes that have both a name and a value, perhaps calling it the Token kind (StdTriviaNode already has both a Name and Value, but it is currently a subclass of StdIdNode.).
  • An empty Name is now allowed. A literal now has a blank name (instead of #literal) and a method that calls anything other than a simple symbol will also have a blank Name. Note: the property will still never return null.
  • I decided to represent expressions in parenthesis using a call with the empty target @``, that is, the empty identifier. But now I am considering whether to eliminate the representation of parenthesis completely, because it's a significant inconvenience for node printers: if parenthesis are represented explicitly, a printer cannot add parenthesis to clarify precedence without changing the syntax tree. So far the EC# printer has managed to avoid adding parenthesis in almost all cases, but there's still the issue of "tightly bound attributes": if the "f" part of "f(x)" has an attribute, it must be printed with the special "grouping parenthesis", ##(), as in ##([attr] f)(x). It's clunky, but it works*. But now I'm designing another language (more on that in my next post) and I haven't figured out what to do about this yet. Perhaps I should just define an exception where "the parenthesis don't count if there are attributes inside"; we'll see.
During the redesign I decided on some small changes to the representation of certain expressions in EC#:
  • The '.' operator is now treated more like a normal binary operator; a.b.c is now represented #.(#.(a, b), c) rather than #.(a, b, c) mainly because it's easier that way, and because the second representation doesn't buy anything significant other than a need for special-casing.
  • int x = 0 will now be represented #var(int, x = 0) rather than #var(int, x(0)). I chose the latter representation initially because it is slightly more convenient, because you can always learn the name of the declared variable by calling var.Args[1].Name. However, I decided that it was more important for the syntax tree to be predictable, with obvious connections between normal and prefix notations. Since I decided that alias X = Y; was to be represented #alias(X = Y, #()), it made sense for the syntax tree of a variable declaration to also resemble its C# syntax. There's another small reason: C++ has both styles Foo x(y) and Foo x = y; if Loyc were to ever support C++, it would make sense to use #var(Foo, x(y)) and #var(Foo, x = y) for these two cases, and I believe C#'s variable declarations are semantically closer to the latter. (Note: another possibility was #var(int, x) = 0, but I decided this wasn't an improvement, it would just shift the pain around.)
  • An constructor argument list is required on all types using the #new operator, e.g. new int[] { x } must have an empty set of arguments on int[], i.e. #new(#of(#[],int)(), x); this rule makes the different kinds of new expressions easier to interpret by making them consistent with each other.
  • A missing syntax element is now represented by an empty symbol instead of the symbol #missing.
  • I've decided to adopt the "in-expression" generics syntax from Nemerle as an unambiguous alternative to angle brackets: List.[int] means List<int> and the printer will use this syntax in cases where angle brackets are ambiguous.
  • By popular demand, constructors will be written this(...) instead of new(...), since both D and Nemerle use the latter notation.
  • The \ and $ characters have been swapped; \S now denotes a symbol S, while $S now denotes a substitution. Originally EC# was designed just as an extension of C#, so \ made sense as a substitution operator for string interpolation because it doesn't hurt backward compatibility: "Loaded '\(filename)' successfully". But now that my focus has shifted to multi-language interoperability, $ makes more sense, as it is used for string interpolation in at least five other languages and it makes sense to use the same character for both string substitution and code substitution.
LLLPG is also improved; I have implemented two more major features:
  • Implemented support for terminals of unknown value. I observed that if the inputs are integers (or enums) LLLPG shouldn't have to know the actual integer value of each integer. The user should be able to use a symbolic name like "NUMBER" or "DQString" without actually telling LLLPG what the value is. So I eliminated the code-snippet generator for Symbols and replaced it with GeneralCodeGenHelper, which supports inputs of any data type: integers, enums, Symbols, strings, whatever. Note that switch() statements require constant cases, so switch() must be disabled for non-constant possibilities such as Symbols. A constructor argument controls whether GeneralCodeGenHelper is allowed to generate switch() statements (ideally individual terminal symbols could be marked constant or non-constant, but a global switch will suffice for now.)
  • Replaced redundant Match() calls with Skip(), i.e. if prediction learns that "LA(0) is in the set ('0'..'9')" then a test like MatchRange('0', '9', '.', '.') is redundant and can be replaced by Skip(). Also, I eliminated redundant Check() calls. These two features together are called "prematch analysis".
Only two major features are left and LLLPG 1.0 will be ready:
  • Syntactic predicate support (only semantic predicates work right now)
  • Input from source code, and a command-line interface. Currently grammars have to be constructed with overloaded C# operators and printed at run-time, which of course is clumsy.

No comments: