Thursday, June 13, 2013

LLLPG, Loyc trees, and LES, oh my!

Background: LLLPG —

I normally don't let these internal "planning" posts go to CodeProject, but my planning has reached an advanced stage and I think it would be fun to hear some feedback on my new parser generator and my plan for a new programming language called LES or LEL (depending on what you're doing with it). The parser generator is called LLLPG (Loyc LL(k) Parser Generator). "Loyc" means "Language of Your Choice" and it is my project for creating a set of generic tools for creating programming languages and for translating code between different languages* ; meanwhile, LL(k) is one of the three most popular types of parsers for computer languages (the others being PEG and LALR(1)... perhaps ANTLR's LL(*) is more popular than LL(k) nowadays, but I'm counting LL(*) as a mere variation on LL(k).)

LLLPG is a recursive-decent parser generator, with a feature set (and syntax) comparable to ANTLR version 2. LLLPG 1.0 is about 80% complete. All that's missing is a parser that can accept grammars in the form of text, and a bit of polishing, but the last 15% is the most important--no one will want to use LLLPG without a nice friendly input language, and right now LLLPG cannot parse its own input.

And I won't be using a dedicated input language like ANTLR does; no, LLLPG is designed to be embedded inside another programming language. Initially you will use LLLPG the same way you would use ANTLR, by invoking an exe as a pre-build step at compile-time, to produce C# code which is then compiled. But eventually, LLLPG will merely be a syntax extension to another language, one of a hundred extensions that you might have installed. You will add LLLPG as a compile-time reference, the same way you might add "System.Core" as a run-time reference today. This means that using a parser generator will be no more difficult than using any other library. When I'm done, you'll be able to easily embed parsers anywhere you want them; creating a parser that is generated and optimized at compile-time will be no more difficult than making a "compiled Regex" is today. But of course, a "compiled Regex" is actually compiled at runtime, while LLLPG will work entirely at compile-time*.

LLLPG has a very different design philosophy than ANTLR. ANTLR is a "kitchen sink" parser generator; it tries to do everything for you, it requires its own runtime library, and if you really want to be good at ANTLR, you'll buy the book that explains how to use it. I can't compete with ANTLR on features; Terrance Parr has been writing ANTLR and its predecessor for almost 20 years now, I think. Instead, LLLPG is designed for simplicity and speed, and its features tend to be orthogonal rather than intertwined; it is fairly easy to learn, it tends to generate fast parsers, the generated code is simple, easy-to-follow, and C#-specific, and it does not need a special runtime library, just a single base class that you can write yourself or copy into your project. Here is its feature set:
  • It generates a prediction-based LL(k) parser with no backtracking. The value of k is the maximum lookahead; it can be customized separately for each rule.
  • It warns you about ambiguity, and you can easily suppress irrelevant warnings (earlier alternatives take priority). Like ANTLR, LLLPG warns by example ("Alternatives (2, 4) are ambiguous for input such as «0x_»"). In case of ambiguity with an exit branch, you can mark loops as greedy or nongreedy to suppress the warning.
  • It supports semantic predicates, denoted &{expr}, which are used for disambiguation. If two branches are ambiguous, you can use a semantic predicate to put a condition on one or both branches that will be used to disambiguate; expr is simply a C# expression.
  • Syntactic predicates, denoted &(foo), are very similar, except that instead of running a C# expression, LLLPG tests whether the input matches the syntax "foo". Syntactic predicates, also known as zero-width assersions, can perform unlimited lookahead, so you can use them to escape the limitations of LL(k) grammars.
  • A gate, denoted p => m, is an advanced (but very simple) mechanism that separates "prediction" from "matching".
  • Performance optimizations include "prematch analysis", switch() statement generation, saving the lookahead terminal(s) in variables, and not using exceptions anywhere.
  • Alts (alternatives, i.e. branch points) have two error-handling modes for unexpected input. LLLPG can either (1) simply run the "default branch", or (2) produce an error-handling branch (automatically or manually). The default branch is normally the last branch, but this can be overridden (either to assist error handling or as a performance optimization.)
  • Actions, denoted {code;}, are code snippets inserted into the parser.
  • The value of a terminal can be assigned to a variable using x:='a'..'z' to create a variable 'x', x='a'..'z' to assign the value to an existing variable 'x', or xs+='a'..'z' to add the value to a list 'xs'.
When LLLPG is formally released, I will repeat all of this in a CodeProject article, or a whole series of them, with more detail. For now, the above information is just background for the real subject of this post.

Bootstrapping LLLPG

When you're making a parser generator that accepts text input, you naturally want to express the parser generator's own syntax in its own language, and if possible, you'd like to use your own parser generator to parse that language. But without any parser, how do you begin? The process of making a parser that will be able to "parse itself" is called bootstrapping.

I started out by taking advantage of C# operator overloading. I am currently writing parsers using the following C# code:
Final syntax C# syntax
'x' C('x') -- or Sym("foo") for the symbol \foo
'a'..'z' Set("[a-z]")
'[' OtherRule ']' C('[') + OtherRule + ']'
a | b | c a | b | c
a / b / c a / b / c -- but a b / c d needs parens: (a + b) / (c + d)
Foo* Star(Foo)
Foo+ Plus(Foo)
Foo? Opt(Foo)
&{i==0} And("i==0") -- using a special #rawText node to hold "i==0"
{Foo(bar);} Stmt("Foo(bar)")
a => b Gate(a, b)
Note that my unit tests have to use the C# syntax, since the tests have to be written before the big grammar that describes LLLPG's final input language.

My original plan for bootstrapping was to make a parser for EC# in C# with LLLPG (EC# is Enhanced C#, and one of its features is a "token literal" that allows embedded DSLs), or at least a parser for just the parts of EC# I needed for bootstrapping. Once this partial parser was written using the above mappings, I could then write a second parser in "partial EC#", and this second parser would be able to parse full EC# as well as itself.

However, as I planned out the parser, I was rather annoyed at how complex the EC# language is (because of its progenitor C#, of course) and I am now planning to create a new syntax that is easier to parse, tentatively called LES, "Loyc Expression Syntax"*.

The syntax tree interchange format

A major part of my plans for Loyc is the concept of an "interchange format" for code. In three words, the concept is "XML for code"--a general representation of syntax trees for any language. The interchange format will
  • Assist tools that convert code between different languages, by providing a way to change a tree incrementally from one language until it conforms to the rules of another language.
  • If a compiler uses Loyc trees internally, it can assist people writing compilers and compiler extensions by providing a simple textual representation of the compiler's intermediate output at various stages.
  • Promote compile-time metaprogramming to the entire software industry.
Let me be clear, I do not advocate XML as an interchange format. In fact, I hate XML! If I have to type < one more time I could scream. Rather, I want to create a standard language for describing syntax trees, just as XML or JSON is used to describe data. (In fact, my language should be useful for describing data too, just like s-expressions can describe data, although that is not its main purpose).

The interchange format will serialize and deserialize text into an in-memory form called a Loyc tree. Every node in a Loyc tree is one of three things:
  • An identifier, such as the name of a variable, type, method or keyword.
  • A literal, such as an integer, double, string or character.
  • A "call", which represents either a method call, or a construct like a "class" or a "for loop".
Unlike in most programming languages, Loyc identifiers can be any string--any string at all. Even identifiers like \n\0 (a linefeed and a null character) are supported. This design guarantees that a Loyc tree can represent an identifier from any programming language on Earth. Literals, similarly, can be any object whatsoever, but when I say that I am referring to a Loyc tree that exists in memory. When a Loyc tree is serialized to text, obviously it will be limited to certain kinds of literals (depending on the language used for serialization).

Each Loyc node also has a list of "attributes" (usually empty), and each attribute is itself a Loyc tree.

Loyc trees are closely related to, and inspired by, LISP trees. If you've heard of LISP, well, the Loyc tree is basically a 21st century version of the S-expression. The main differences between s-expressions and Loyc trees are
  1. Each "call" has a "target". Whereas LISP represents a method call with (method arg1 arg2), Loyc represents a method call with method(arg1, arg2). In LISP, the method name is simply the first item in a list; but most other programming languages separate the "target" from the argument list, hence target(arg1, arg2). If you want to create a simple list rather than a method call, the identifier "#" (tentatively) means "list" by convention, as in #(item1, item2, item3).
  2. Each node has a list of attributes. The concept of attributes was inspired by .NET attributes, so it will not surprise you to learn that a .NET attribute will be represented in a Loyc tree by a Loyc attribute. Also, modifiers like "public", "private", "override" and "static" will be represented by attaching attributes to a definition.
  3. A "call", which represents either a method call, or a construct like a "class" or a "for loop". By convention, constructs that are built into a language use a special identifier that starts with "#", such as #class or #for or #public.
  4. Each node has an associated source file and two integers that identify the range of characters that the original construct occupied in source code. If a Loyc tree is created programmatically, a dummy source file and a zero-length range will be used.
Obviously, a text format is needed for Loyc trees. However, I think I can do better than just an interchange format, so I have a plan to make LES into both an interchange format and a general-purpose programming language in its own right.

Since LES can represent syntax from any language, I thought it was sensible to design it with no keywords. So tentatively, LES has no reserved words whatsoever, and is made almost entirely of "expressions". But "expressions" support a type of syntactic sugar called "superexpressions", which resemble keyword-based statements in several other languages.

I've made a somewhat radical decision to make LES partially whitespace-sensitive. I do not make this decision lightly, because I generally prefer whitespace to be ignored inside expressions*, but the whitespace sensitivity is the key to making it keyword-free. More on that later.

My original plan was to use a subset of Enhanced C# as my "XML for code". However, since EC# is based on C# it inherits some very strange syntax elements. Consider the fact that (a<b>.c)(x) is classified a "cast" while (a<b>+c)(x) is classified as a method call (As I am writing this in HTML, the < made me scream just now. Sorry for the noise.) Features like that create unnecessary complication that should not exist in an AST interchange format.

Meet LES: tentative design

As I describe the design for Loyc Expression Syntax, keep in mind that this is all tentative and I could be persuaded to design it differently. I have the following goals:
  • Concision: LES is not XML! It should not be weighed down by its own syntax.
  • Familiarity: LES should resemble popular languages.
  • Utility: although LES can represent syntax trees from any language, its syntax should be powerful enough to be the basis for a general-purpose language.
  • Light on parenthesis: I'm sick of typing lots of parenthesis in C-family languages. LES is my chance to decrease the amount of Shift-9ing and Shift-0ing that programmers do every day.
  • Simplicity: the parser should be as simple as possible while still meeting the above goals, because (1) this will be the default parser and printer for Loyc trees and will be bundled with the Loyc tree code, so it should not be very large, and (2) this is the bootstrapping language for LLLPG so (for my own sake) it can't be too complex.
First let me describe a simple subset of LES, the "prefix notation". Prefix notation is nothing more than expressing everything as if it was made out of method calls. Basically it's LISP, but with Loyc trees instead of S-expressions.

For example, C# code like
   if (x > 0) {
      Console.WriteLine("{0} widgets were completed.", x);
      return x;
   } else
      throw new NoWidgetsException();
might be represented in prefix notation as
   #if(@#>(x, 0), @`#{}`(
      @#.(Console, WriteLine)("{0} widgets were completed.", x),
      #return(x)
   ),
      #throw(#new(NoWidgetsException()))
   );
The prefix "@" introduces an identifier that contains special characters, and then the backquotes are added for identifiers with very special characters. So in @`#{}`, the actual identifier name is #{}, which is the built-in "function" that represents a braced block. Similarly @#> and @#. represent the identifiers "#>" and "#.". Please note that '#' is not considered a special character, nor is it an operator; in LES, # is just an identifier character, just as an underscore _ is an identifier character in the C family. However, # is special by convention, because it is used to mark identifiers that represent keywords and operators in a programming language.

Pure prefix notation is a bit clunky, so it will not be used very much. LES will have built-in operators so that the above code can be represented in a more friendly way:
   #if x > 0 {
      Console.WriteLine("{0} widgets were completed.", x);
      #return x;
   }
      #throw #new(NoWidgetsException());
However, this notation might be confusing when you first see it. "Why is there no #else?" you might wonder. "How do I even begin to parse that," you might say.

Well, here's my plan:
  • Braces {...} are simply subexpressions that will be treated as calls to the special identifier #{}. Inside braces, each argument to the #{} pseudo-function will end with a semicolon (the semicolon on the final statement is optional).
  • To make an identifier that contains punctuation, use the @ prefix. Such identifiers can contain both punctuation and identifier characters, e.g. @you+me=win!. Note that when I talk about "punctuation", the characters , ; { } ( ) [ ] ` " # do not count as punctuation (# counts as an identifier character). If you need extra-special characters such as these, enclose the identifier in backquotes: @`{\n}`. Escape sequences are allowed within the backquotes, including \` for the backquote itself. Identifiers can contain apostrophes without using @, e.g. That'sJustGreat (there are several languages that use apostrophes in identifiers). However, an apostrophe cannot be the first character or it will be interpreted as a character literal (e.g. 'A' is the character 'A').
  • There will be an unlimited number of binary operators such as +, -, !@$*, >>, and so on. An operator such as &|^ will correspond to a call to #&|^, e.g. x != 0 is simply a friendly representation of @#!=(x, 0). Simplicity being an important design element of LES, the precedence of each operator will be determined automatically based on the name of the operator and a fixed set of rules. For example, the operator ~= will have a very low precedence based on a rule that says "an operator whose name ends in '=' will have the same precedence as the assignment operator '='". This design allows LES to be "context-free"; an LES parser is virtually stateless and does not need a "symbol table" of any kind. A person reading LES only needs to know the precedence rules in order to figure out the precedence of a new operator that he or she has never seen before.
  • Operators whose names do not start with # can be formed with the \ prefix. Such operators can contain alphanumerics and punctuation, e.g. \div and \++ are binary or prefix operators; a \div b is equivalent to div(a, b) and a \++ b is equivalent to @++(a, b). Again, if you need extra-special characters, enclose the operator in backquotes: \`/*\n*/`.
  • LES will have "super-expressions", which are multiple expressions juxtaposed together with no separator except a space. If a "super-expression" contains more than one expression, the first expression contains a call target and the other expressions are arguments. For example, a.b c + d e; parses as three separate expressions: a.b, c + d, and e. After gathering this expression list, the parser treats the final "tight" expression in the first expression as "calling" the others (explanation below). Thus, together, the three expressions will be shorthand for a.b(c + d, e);.*
I realize that "superexpressions" might be confusing at first, but I think this design allows a rather elegant way to write "expressions" as if they were "statements". For example, the following is a valid expression:
if a == null {
  b();
} else {
  c();
};
and it actually means
if(a == null, { b(); }, else, { c(); });
which, in turn, is shorthand for
if(@#==(a, null), @`#{}`( b(); ), else, @`#{}`( c(); ));
All three inputs are parsed into the same Loyc tree.

Please note that none of this has any "built-in" meaning to LES. LES doesn't care if there is an "else" or not, it doesn't care if you use "#if" or "if", LES doesn't give any built-in semantics to anything. LES is merely a syntax--a text representation of a Loyc tree.

So the answer to questions like "why is there no #else" or "why did you use '#if' in one place, but 'if' in another place" is, first of all, that LES doesn't define any rules. Programming languages define these kinds of rules, but LES is not a programming language. It is just a textual representation of a data structure.

It is easy enough for the parser to parse a series of expressions. Basically, whenever you see two identifiers side-by-side (or other "atoms" such as numbers or braced blocks), like foo bar, that's the boundary between two expressions in a superexpression. When the parser sees a superexpression, it must decide how to join the "tail" expressions to the "head" expression. For example, this superexpression:
a = b.c {x = y} z;
Consists of the expressions "a = b.c", "{x = y}" and "z". How will these expressions be combined? My plan is basically that the other expressions are added to the first expression as though you had written this instead:
a = b.c({x = y}, z);
So it's as if the first space character that separates the first and second expressions is replaced with '(', a ')' is added at the end, and commas are added between the other expressions. "b.c" is what I am calling a "tight expression" because the operators in the expression (e.g. ".") are tightly bound (high precedence), and the expression is normally written without space characters. Confused yet?

The point of a "superexpression" is that you can write statements that look as if they were built into the language, even though they are interpreted by a fixed-function parser that has no idea what they mean. For example, in LES you can write
try {
   file.Write("something");
} catch(e::Exception) {
   MessageBox.Show("EPIC I/O FAIL");
} finally {
   file.Dispose();
};
And the parser handles this without any clue what "try", "catch" and "finally" mean.

Superexpressions also allow you to embed statement-like constructs inside expressions, like this:
x = (y * if z > 0 z else 0) + 1;
and the LES parser will understand that "z > 0", "z", "else" and "0" are arguments in a call to "if".

A syntax highlighting engine for LES should highlight the "tight expression" to indicate how the superexpression is interpreted. This syntax highlighting rule would highlight words like "if" and "try" when they are used this way (but not "else" or "finally", which are merely arguments to "if" and "try"; the parser cannot know that they have any special meaning.)

It's not quite as simple as I implied, though. If the first expression already ends with a method call:
a = b.c(foo) {x = y} z;
I think it would be better to interpret that as
a = b.c(foo, {x = y}, z);
rather than
a = b.c(foo)({x = y}, z);
A motivation for this exception is explained later.

Only one superexpression can exist per subexpression. For example, an expression with two sets of parenthesis, ... (...) ... (...) ... can contain at most three superexpressions: one at the outer level and one in each set of parenthesis.

You cannot "nest superexpressions" without parenthesis. for example, an expression of the form if x if y a else b else c can be parsed as if(x, if, y, a, else, b, else, c), but it won't mean anything because the second "if" is merely one of 8 arguments to the first "if"; there is nothing to indicate that it is special.

Did you notice how this plan is space-sensitive?

In LES you normally separate expressions with a space. Operators--that is, sequences of punctuation--are assumed to be binary if possible and unary if not. So "x + y" and "- z" are understood to be single expressions with binary and unary operators, respectively. When you put two of these expressions side-by-side, such as "- z x + y", it is seen as two separate expressions. If you write "x + y - z", on the other hand, this is understood as a single expression because "-" is assumed to be a binary operator in the second case. "x + y (- z)" is parsed as two expressions again. The whitespace between operators and arguments doesn't matter much; "x+y (-z)" is also acceptable and means the same thing. But the space between the expressions is important; "x+y (-z)" is two separate expressions, while "x+y(-z)" is a single expression meaning "x + (y(-z))"; here, y is the target of a method call.

I know it's not ideal. But in my experience, most programmers write
 if (...) {...}
 for (...) {...}
 lock (...) {...}
 while (...) {...}
 switch (...) {...}
with a space between the keyword and the opening paren, while they write
 Console.WriteLine(...);
 var re = new Regex(...);
 F(x);
with no space between the identifier and the method being called. So in a language that has no real keywords, it seems reasonable, although not ideal, to use a spacing rule to identify whether the first word of an expression is to be treated as a "keyword" or not.

So the spacing rule exists to identify "keyword statements" in a language that has no keywords. LEL can parse anything that looks like a keyword statement:
 if ... {...};
 for ... {...};
 lock ... {...};
 while ... {...};
 switch ... {...};
 unless ... {...};
 match ... {...};
But notice that a semicolon is needed at the end of each statement to separate it from the following statement. No parenthesis are required around the "subject" (e.g. the "c" in "if c"), but if parenthesis are present, the space character will ensure that the statement is still understood correctly. Statements like the above are automatically translated into an ordinary call,
 if(..., {...});
 for(..., {...});
 lock(..., {...});
 while(..., {...});
 switch(..., {...});
 unless(..., {...});
 match(..., {...});
There is another way that LES is whitespace-sensitive. Since LES contains an infinite number of operators, two adjacent operators must have spaces between them. For example, you cannot write x * -1 as x*-1 because *- will be seen as a single operator named #*-.

Attributes are expressed in LES with the square brackets:
[Flags] enum Emotion { 
   Happy = 1; Sad = 2; Angry = 4; Horndog = 65536;
};
There are some more details I could mention about LES, but you get the basic idea.

Meet LEL: the exact same language

LEL ("Loyc Expression Language") will be a programming language built on top of LES ("Loyc Expression Syntax"). It will use exactly the same parser as LES; in fact we can say that it "uses" LES, because LES consists of a parser and nothing else. LEL will then give meaning to the syntax. For example, in LES the word "if" is just an identifier, it has no built-in meaning. But in LEL, "if" will be defined as a standard macro that creates a standard "#if" expression, which will be part of some sort of common language that EC# and LEL will both use internally.

So in LEL,
if a > 0 {
  b();
} else {
  c();
};
means just what you would expect: call b() if a>0, otherwise call b(). However, LEL will not be a typical programming language. "if" will not be a built-in statement like it is in C#, for example. Instead, LEL will be a macro-heavy language, and "if" will be a standard macro. A macro, if you didn't know, is a special kind of method that runs at compile-time. Of course, LEL macros will be unrelated to C/C++ macros, which give the word "macro" a bad name. Like C/C++ macros, it is quite possible to abuse LEL macros; but unlike C/C++ macros, LEL macros will be fantastically powerful.

I don't have the details worked out about the macro system yet; I'll be sure to read more about Nemerle for inspiration.

The semantics of LEL will be based on EC#, and at first it will compile down to plain C# so it will not be able to surpass the runtime capabilities of C#, although it will still allow crazy-named identifiers by mapping punctuation to letters; for example @Save&Close will be represented in C# as something like Save_and_Close.

Things like classes and using statements are not hard to support with LES superexpressions, and I'm thinking about using :: for variable declarations:
using System.Collections.Generic;
[pub] class Point(ICloneable.[Point])
{
   [pub] X::int; // pub for public, see?
   [pub] Y::int;
};
I'm not an abbreviation freak, but [public] requires brackets around it, which is unfortunate, so I'll compensate for this extra syntactic noise by letting you abbreviate the attributes: public to pub, private to priv, static to stat, virtual to virt, and so on. Also, as in C#, you can separate attributes with a comma or use multiple bracket groups: [pub,stat] and [pub][stat] are equivalent.

To allow :: for variable declarations, I'm giving :: a precedence below dot (.) even though :: has the same precedence as dot in C# and C++. This allows s::System.String to parse as s::(System.String). (You might wonder why I don't just use a single colon : for this; well, for now I'll keep it as my little secret.)

".[...]" is the syntax for generic type arguments; ICloneable.[Point] has the interpretation #of(ICloneable, Point) which means ICloneable<Point> in C#. I could consider using the D syntax instead (ICloneable!Point), it's not set in stone. However, the C++/Java/C# syntax ICloneable<Point> is out of the question. It would be hard to parse and it makes the "superexpression" concept highly ambiguous.

It may seem possible to support C-style variable declarations like
String x;
But it really isn't; String x just means String(x)--it's an ordinary function call.

The method declaration syntax is a bit of a head-scratcher. I'm thinking about using this syntax:
[pub] Square(x::int)::int { x * x };
which is parsed as
[pub] (Square(x::int)::int)({ x * x });
[pub] @#::(Square(x::int), int)({ x * x }); // equivalent
[pub] @#::(Square(@#::(x, int)), int)(@`#{}`(x * x)); // equivalent
I'm just not sure how to write the specification for LEL in such a way that the compiler can make sense out of this peculiar representation. I'd also, just this once, like to lift the restriction against spaces between the method name and its arguments, so that you can write
@pub Square (x::int)::int { x * x };
but this form has the rather different syntax tree
@pub Square((x::int)::int, { x * x });
And this would be essentially indistinguishable from a call to a macro called "Square", so I'm a bit stumped. Hmm...

Let's talk about this weird spacing rule again, because you might wonder what happens if the spacing is incorrect. Obviously, the mistakes will be (1) to forget a space before '(', and (2) to use a space before '(' where there should not be one.

First consider the following example:
if(c) { a(); } else { b(); };
This will actually still work most of the time, because if there is already an argument list on the end of the first expression, the other expressions are added to that existing argument list (as I mentioned earlier). So this example is interpreted the same way as the correct version,
if (c) { a(); } else { b(); };
which means
if((c), { a(); }, else, { b(); });
But there is another version of this mistake that looks like this:
if(x+y) > z { a(); } else { b(); };
And this can be parsed as
if(x+y) > z({ a(); }, else, { b(); });
The 'if' macro itself could produce a somewhat understandable error message like "'if' expects 2 or 3 arguments but received only one. It looks like you forgot a space after 'if'." This would be followed by a complaint about there not being any method 'z' that takes 3 arguments.

The second mistake, adding a space when there should not be one, has different effects depending on the situation. In very simple cases like this one,
foo (x);
The interpretation is foo((x)); so it works. But in other cases, such as
x = foo (x) + 1;
a = foo (b, c);
The interpretations are different:
x = foo((x) + 1);
a = foo((b, c));
In the second case, (b, c) is parsed as a tuple (I haven't decided on the representation for tuples in LES yet), that is passed as a single parameter to foo().

Unfortunately, the code can still be parsed, but changes meaning. The primary defense against both of these mistakes is syntax highlighting: foo will show up in a different color when it is followed by a space than when it is not. I only hope that this distinction will be enough.

Assuming that foo() is a method, I suppose that LEL could prefer the "superexpression" syntax for macros, and issue a warning if a superexpression is used to call an ordinary method (the parser records the fact that "superexpression" syntax was used in the NodeStyle).

My current thinking is that the standard macros of LEL will essentially convert LEL into "built-in" constructs which are then compiled, and these constructs will be identical or very similar to EC# constructs, so that LEL and EC# can share the same back-end. With this design, LEL can easily be converted to C#, since I am planning to write an EC#-to-C# compiler. So the "if" macro, for example, will take input such as
if (a) b else c;
and produce output such as
#if (a) b c;
or in other words,
#if((a), b, c);

And now back to LLLPG

Remember, besides designing two new programming languages (LEL and EC#) I'm also writing a parser generator that will be used to describe both of these languages. After writing an LES parser, I will use it to describe the full grammar of EC#.

Please note that I already wrote a complete node printer for the EC# language, and the parser generator uses this to generate plain C# code (EC# is a superset of C#, so the parser generator simply generates a Loyc tree representing C# code, and prints it out using the EC# node printer.)

Right now, I can define a grammar in plain C# like this:
Rule Int = Rule("Int", Plus(Set("[0-9]")), Token);
// represents the grammar '0'..'9'+, which in LES might be written
//    Int @[ '0'..'9'+ ];
// Here, @[...] is a "token literal". It is tokenized by LES, but
// otherwise uninterpreted. This allows the LLLPG macro that is 
// parsing the grammar to interpret it instead.
and then I can generate C# code like this:
LLLPG.AddRule(Int);
LNode result = LLLPG.GenerateCode();
string CSharpCode = result.Print();
Where "LNode" is a "Loyc node", the data type of a Loyc tree. Here is the generated code:
{
  private void Int()
  {
    int la0;
    MatchRange('0', '9');
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
  }
}
My end goal is to let you write a grammar in EC# syntax, which will look something like this:
using System;
using System.Collections.Generic;

[[LLLPG(IntStream)]]
class MyLexer {
  private Number ==> @[ 
    &('0'..'9'|'.')
    '0'..'9'* ('.' '0'..'9'+)?
  ];
  private Id      ==> @[ IdStart IdCont* ];
  private IdStart ==> @[ 'a'..'z'|'A'..'Z'|'_' ];
  private IdCont  ==> @[ IdStart|'0'..'9' ];
  private Spaces  ==> @[ ' '|'\t'|'\n'|'\r' ];
  public  Token   ==> @[ Id | Spaces | Number ];

  public void NormalMethod()
  {
    NormalCSharpCode();
  }
}
"==>" is the new EC# "forwarding operator". Never mind what that means, because in this case I have hijacked it for expressing LLLPG rules. Again, @[...] is a token literal that LLLPG will parse. "IntStream" refers to a helper class for code generation; it means that the input is a series of integers (actually characters, plus -1 to represent EOF).

So how do I get from here to there? Here's my plan for the bootstrapping process.
  1. First, I'll write a grammar for LES in C#, using LLLPG for code generation.
  2. Next, I'll write a skeleton version of LEL that can run simple non-hygienic macros (hygienic macros I can add later, when more infrastructure exists).
  3. One of the macros will gather input for LLLPG. Keep in mind that at this point there is no parser for the token literals--no parser that can understand input like "0x" ('0'..'9'|'a'..'f')+. So instead I will use LES as an intermediate language. LES supports suffix operators that start with "\\", so I could express a grammar with a syntax like this:
    LLLPG' IntStream (
      [pub] class MyLexer() {
        [priv] rule Number
          &('0'::'9'|'.') +
          '0'::'9'\\* + ('.' + '0'::'9'\\+)\\?;
        [priv] rule Id      IdStart IdCont\\*;
        [priv] rule IdStart 'a'::'z'|'A'::'Z'|'_';
        [priv] rule IdCont  IdStart|'0'::'9';
        [priv] rule Spaces  ' '|'\t'|'\n'|'\r';
        [pub]  rule Token   Id | Spaces | Number;
      }
    });
    
    Here I have changed '..' to '::' because '::' has higher precedence than '|' and '+' ('..' has lower precedence than both of them). The backslashes "\\" inform the LES parser that the "\\*" and "\\?" operators are suffix operators, not binary operators. This syntax isn't entirely comfortable, but it's better than the C# representation.
  4. The LLLPG macro will take the Loyc tree that came from the LES parser and translate it into a series of "Rule" objects. These objects are the "real" input to LLLPG's core engine. This conversion process will be written in plain C#.
  5. Next, I'll use this syntax to write a parser for the token literals. This parser will simply translate the token literals into the form you see above, so
    LLLPG IntStream (
      [pub] class MyLexer() {
        [priv] Number @[ 
          &('0'..'9'|'.') 
          '0'..'9'* ('.' '0'..'9'+)?
        ];
        [priv] Id      @[ IdStart IdCont* ];
        [priv] IdStart @[ 'a'..'z'|'A'..'Z'|'_' ];
        [priv] IdCont  @[ IdStart|'0'..'9' ];
        [priv] Spaces  @[ ' '|'\t'|'\n'|'\r' ];
        [pub]  Token   @[ Id | Spaces | Number ];
      }
    });
    
    Is translated to the form above. Thus, translating from LES to LLLPG Rules is a two-stage process with two separate interpreters. The reason for doing it this way is to allow third-party macros to run in between the two stages. My current thinking is that LEL could contain "high priority" and "low priority" lexical macros. First the high-priority macros run, then the low-priority ones run on a separate pass.

    Notice that I used "LLLPG'" in the first code listing but "LLLPG" in the second listing. LLLPG will be the high-priority macro and LLLPG' the low-priority second stage. This allows other lexical macros to process the code after the first stage, which will help end-users to do fancy things like generating grammars at compile-time.

    To tell you the truth, I have never seriously programmed in a language with a LISP-style macro system, so my approach may be flawed. I guess we'll find out.
  6. With both LLLPG fully implemented in LES, I can finally write a parser for the entire EC# language, in LES. (and remember, the LES code is translated to C# for compilation.) Once that's done, LLLPG can also support EC# as a host language with minimal changes. Bootstrapping complete!
I hereby open the floor to your comments and questions.

< Published on >

Saturday, June 1, 2013

D-style ranges in C# (.NET)

Ranges are the most important concept in the design of modern D collection code, and I want them to be included under the Loyc umbrella (but for now, don't worry what Loyc is.)

As I wrote in an as-yet unpublished article on about the design of my new language, Enhanced C#:
The really key thing I like about .NET is that it is specifically a multi-language, multi-OS platform with a standard binary format on all platforms--much like Java, but technically superior. It's only "multi-OS", of course, insofar as you avoid Windows-only libraries such as WPF and WCF, and I would encourage you to do so (if your boss will allow it). .NET solves the "integration gap" as long as the languages you want to mash up are available in .NET.

Without a multi-language runtime, interoperability is limited to the lowest common denominator, which is C, which is an extremely painful limitation. Modern languages should be able to interact on a much higher level.

A multi-language platform avoids a key problem with most new languages: they define their own new "standard library" from scratch. You know what they say about standards, that the nice thing about standards is that there are so many to choose from? I hate that! I don't want to learn a new standard library with every new language I learn. Nor do I want to be locked into a certain API based on the operating system (Windows APIs and Linux APIs). And all the engineering effort that goes into the plethora of "standard" libraries could be better spent on other things. The existence of many standard libraries increases learning curves and severely limits interoperability between languages.

In my opinion, this is the most important problem .NET solves; there is just one standard library for all languages and all operating systems (plus an optional "extra" standard library per language, if necessary). All languages and all operating systems are interoperable--it's really nice! The .NET standard library (called the BCL, base class library) definitely could be and should be designed better, but at least they have the Right Idea.
Unfortunately, while the .NET standard library has a lot of useful functionality, it has a lot of missing pieces, and pieces whose design was poorly thought out.

Since Loyc is all about interoperability between languages, it makes sense that there should be a "Loyc standard library" which provides similar functionality in a variety of programming languages. The .NET BCL is a good start, but not quite good enough. I certainly haven't had time to design a full-fledged standard library, but I'd like to talk today about some ideas I have about collections.

One of the most important elements of any standard library is the way it handles collections. Unfortunately, .NET's collection interfaces are a "bare minimum" design. There's IEnumerable(T), ICollection(T), IList(T), IDictionary(T) and that's about all.
  • Read-only lists must use the same interface as writable lists. This reduces type safety, and a read-only list is incompatible with .NET generics variance. Plus, implementing the entire IList(T) interface is a huge pain in the butt when you only want to provide read-only access.
  • Similarly for read-only collections: you can either implement just IEnumerable(T)--but then Count and Contains() are inaccessible--or the full ICollection(T) interface with Clear(), Add(), Remove(), and even implementing IsReadOnly and CopyTo is a chore.
  • There are no interfaces with AddRange() or InsertRange()
  • There is no support for slices (sections of a list or a sequence)--this is the single biggest problem in my opinion.
  • Although an enumerator can sometimes be restarted from the beginning with Reset(), you cannot save the current position and start from there later.
  • There is no specific support for backward enumeration, even though most collections could easily do it.
  • Various elements of the design are less efficient than they could be, such as the need to call two interface methods per iteration of IEnumerable, or the fact that two separate operations are required to perform operations such as "add value to dictionary if key does not already exist" (known as AddIfNotPresent in my new mutable dictionary, MMap<T>) or "get value and remove pair" (GetAndRemove).
  • There's lots of other stuff one could ask for.
Here are some of the collection interfaces my Loyc.Essentials library includes already:
  • ISource(out T): IEnumerable(T) plus ICount, which represents the Count property alone (covariant)
  • IListSource(out T): a simple read-only list collection (covariant)
  • INegListSource(T), INegArray(T) and INegAutoSizeArray(T): interfaces that represent lists that do not start at index zero
  • IPush(T), IPop(T), IStack(T), IQueue(T): interfaces with "push" and "pop" actions; IStack(T) and IQueue(T) are identical, but represent different behaviors
  • ISetImm(T,SetT): interface that represents an immutable set, with sub-interfaces ISetOperations(T,SetT) and ISetTests(SetT). (SetT is the type that implements the interface. It's a bit inconvenient to use interfaces parameterized on the set type itself, but I found that actually implementing a set interface that is not parameterized on itself is a damned inconvenience, but I digress.)
  • IArray(T): interface that represents a collection that can be modified, but cannot be resized.
  • IAutoSizeArray(T): an IArray(T) that resizes itself automatically when you assign an item to a new index.
  • IAddRange(T), IListRangeMethods(T): interfaces that contain additional collection methods such as AddRange() and RemoveRange()
  • ISinkCollection(in T), ISinkArray(in T), ISinkList(in T): interfaces that represent collections you can write to, but not read from (contravariant).
  • ICollectionEx(T): combines ISource(T), ISinkCollection(T), IAddRange(T) and RemoveAll(Predicate(T)), and of course ICollection(T).
  • IListEx(T): combines ICollectionEx(T), IArray(T), ISinkList(T), and of course IList(T).
I have observed that there are never quite enough interfaces: some people want an interface that includes AddRange(...), others will want an interface with RemoveAll(...); somebody will implement a collection that can Add() but not Remove() items; and almost no one actually calls CopyTo(). To meet everyone's needs, Go-style interfaces, which .NET does not support, would really help. With these, the missing interfaces can be retrofitted onto existing classes. And certainly Go-style interfaces are the best workaround for a standard library like the .NET BCL, whose built-in interfaces are so terribly impoverished.

But without support for Go-style interfaces, the best alternative is to define a large number of interfaces, and to provide adapter types to coerce old collections to the new interfaces.

Anyway, all of this is just background for the real topic of today's post: Ranges.

Ranges are an improvement on the C++ concept of iterators. I don't exactly know how ranges were invented in D, but perhaps someone noticed that most of the C++ STL algorithms require pairs of iterators:
bool            all_of(InputIterator first, InputIterator last, UnaryPredicate pred);
Function      for_each(InputIterator first, InputIterator last, Function fn);
InputIterator     find(InputIterator first, InputIterator last, const T& val);
difference_type  count(InputIterator first, InputIterator last, const T& val);
ForwardIterator unique(ForwardIterator first, ForwardIterator last, BinaryPredicate pred);
void           reverse(BidirectionalIterator first, BidirectionalIterator last);
void              sort(RandomAccessIterator first, RandomAccessIterator last, Compare comp);
ForwardIterator lower_bound(ForwardIterator first, ForwardIterator last, const T& val);
In fact, most STL algorithms require exactly two iterators--a range--and none require only a single iterator (I think). Passing ranges as iterators tends to be highly cumbersome; you are constantly repeating yourself:
push_heap(currentNode().children.begin(), currentNode().children.end());
I once read an in-depth article that explained all of this in more detail--I can't find the article now, but it discussed a "corner case" in C++ that requires three iterators instead of two, as well as how STL algorithms that return a single iterator (within a range) can map to D by returning a range. Since D does not use the iterator concept, if a D algorithm wants to return a reference to a single element within a range, instead of returning an iterator, it will return a range whose "front" is the single element that the algorithm wanted to return. But this range can be moved forward as usual with "popFront". Thus, for example, if you search for "6" in a list of numbers from 1 to 10, the search algorithm would return a range from 6 to 10.

I frequently run into difficulty in .NET because it does not have a "range" concept. We have LINQ, but that's not really enough. For instance if I want to get a sequence representing the second half of an array, I could write
     array.Skip(array.Length/2)
But the Skip() operation just wasted array.Length/2 loop iterations skipping elements one at a time, wasting a lot of CPU time, and the return value is no longer array-like--it's just an IEnumerable that can only be read from beginning to end. A lot of programmers are willing to "solve" the problem by throwing away CPU time:
     array.Skip(array.Length/2).ToArray() // viola!
but that's not good enough for me.

I'd like to solve many of .NET's limitations by bringing ranges from D to C#. D ranges are explained in a tutorial at this link. In summary, D has four kinds of ranges:
  • InputRange: requires the empty, front and popFront() member functions
  • ForwardRange: additionally requires the save member function
  • BidirectionalRange: additionally requires the back and popBack() member functions
  • RandomAccessRange: additionally requires the [] operator, and a length property if the range is finite
Now, there are some important differences between D and C#, so D-style ranges cannot be ported directly to C#:
  • D has templated methods that can examine an object at compile time to decide how to treat it. For example, to support output ranges, the put(range, element) operation can call range.put(element) if it exists, or if there is no put() method, it can modify range.front and then call range.popFront() instead.
  • D ranges are generally easier to copy (although this is not guaranteed).
The issue of how to copy a range is most irksome. To make this situation clear, imagine a C++ iterator. C++ iterators are modeled after C++ pointers, and if you copy a pointer:
   int* x = &array[i];
   int* y = x;
   x++;
Then obviously, the two pointers are independent: modifying x did not affect y. The simple '=' operator makes a copy, and copies are independent of each other.

So that's how C++ iterators work too. When you transfer an iterator with '=', you expect the copies to be independent, and they are (when implemented properly).

Since ranges are essentially just pairs of iterators, ideally, you would expect ranges to work the same way (although I hear that in D, it doesn't always work that way.)

Unfortunately, in .NET it is generally impossible to make ranges work that way. Certainly some ranges can work that way, such as a simple array range, which would naturally be a struct type:
struct ArraySlice<T> : IRange<T> {
  T[] _array;
  int _start, _count;
  ...
  public T this[int index] { 
    get { 
      /* TODO: check if index is out of range */ 
      return _array[_start + index];
    }
  }
  ...
}
Unfortunately, using '=' to make a copy is impossible in general. First of all, if IRange(T) is an interface that represents a random-access range, simply casting ArraySlice(int) to IRange(int) will defeat the '=' operator, so that it no longer makes copies.

And even if every range type were a struct, that still doesn't mean that they could be safely copied with '='. Consider a forward range that walks a B+ tree. In order to walk the tree, it is necessary to have a stack--a list of parent nodes, which will be an array or perhaps a Stack(T):
struct TreeRange<T> : IFRange<T> { // forward range
  private Stack<Pair<InternalNode<T>, int>> _parents;
  private LeafNode<T> _curNode; // node being walked right now
  private int _index;           // the current item in _curNode
  ...
}
But C# does not allow any action to be taken when a struct is copied, so if the array or Stack is stored in a struct, the Stack will not be copied when the '=' operator is used, even though all the other fields are copied. The two copies will have independent _index and _curNode fields, but share the same Stack! Therefore, after you've copied the range, advancing one of the copies of the range will (after a few iterations, anyway) completely screw up the other copy, making it behave in some bizarre way.

D mostly solves this problem with a "postblit constructor": like C#, D copies value types bit-for-bit; but unlike C#, D then calls the "postblit constructor", known as this(this) in code, on the copy. This allows the copy to properly "finish" the copy operation.

I cannot add postblit constructors to C#, but I still think that the range concept is important enough to support. Therefore, I am adding range interfaces to my Loyc.Essentials library in the Loyc.Collections namespace. I recommend that if a range type is not safe to copy with '=' (such as the TreeRange example), it should be a class instead. Luckily, the .NET garbage collector is much faster than the one in D, so using a class does not usually cause significant performance problems.

Besides, D does not solve the copying problem completely:
  1. D has a class of ranges called "input ranges" that cannot be copied, and
  2. D ranges could be implemented as class objects, which are very nearly the same as C# class objects; the '=' operator does not copy classes
For this reason* D has a save() function (actually it is a property, curiously enough) on ranges that duplicates the range safely. (* I speak as if I know what I'm talking about, but please note that I have not written any real programs with D.)

For the C# version of ranges, the interfaces will inherit ICloneable<R> so you can copy ranges explicitly with Clone(); this is equivalent to the save() method in D.

Here are the proposed range interfaces: IFRange (forward range), IBRange (bidirectional range), and IRange (finite random-access range). Note that it is not really necessary to define an "input range"; the standard IEnumerator(T) interface is already equivalent to an input range. Perhaps I'll add an IIRange (infinite random-access range) later, and maybe an output range, but for now I have defined these three:
public interface IFRange<out T> : IEnumerable<T>, ICloneable<IFRange<T>>
{
  bool IsEmpty { get; }
  T First { get; }
  bool DropFirst();
  T PopFirst(out bool empty);
}
public interface IBRange<out T> : IFRange<T>, ICloneable<IBRange<T>>
{
  T Last { get; }
  bool DropLast();
  T PopLast(out bool empty);
}
public interface IRange<out T> : IBRange<T>, IListSource<T>, ICloneable<IRange<T>>
{
}
// IListSource<T> is a read-only list (not a range) and ISource<T>
// is simply IEnumerable plus Count. I am wondering whether the naming 
// is appropriate; I chose the name "Source" as opposed to "Sink": a 
// Source provides data, while Sink accepts data.
public interface IListSource<out T> : ISource<T>
{
  T this[int index] { get; }
  T TryGet(int index, ref bool fail);
  IRange<T> Slice(int start, int count);
}
And in addition there are three mutable variants, IMFRange (allows you to modify First), IMBRange (allows you to modify First and Last) and IMRange (allows you to modify any element).

Tentatively I have used the names "First" and "Last" instead of "Front" and "Back" because the collection classes I've already written contain "First" and "Last" properties. I'm debating whether to change this for consistency with D. D's "empty" becomes "IsEmpty" in .NET, for consistency with other properties such as "IsReadOnly".

"DropFirst()" and "DropLast()" could just as easily be called "PopFirst()" and "PopLast()" for consistency with D, but I would like to eventually have extension methods called "PopFirst()" and "PopLast()" which not only chop off the first element of the range, but return that first element at the same time. These extension methods do not currently exist, however, due to a limitation of C#. Since a range could be a struct type, the extension method must take the range by reference to guarantee that it will actually be modified! Unfortunately, "(this ref Range self)" is an illegal argument list in C#. Nevertheless, I hope Enhanced C# will eventually support this type of extension method.

Finally, I've added special Pop operations that D does not have, and a Slice() method from IListSource:
  T PopFirst(out bool empty);
  T PopLast(out bool empty);
  IRange<T> Slice(int start, int count);
Come to think of it, I don't know how D solves the problem of slicing random-access ranges. Probably something involving the "$" and ".." operators.

Anyway, the thing I wanted to highlight was that PopFirst contains the functionality of IsEmpty, First, and DropFirst() all in one single method. It saves the value of IsEmpty in the 'empty' parameter, gets the value of First, drops the first element, and returns the old value of First (if !empty). I have complained loudly in the past about the fact that IEnumerator(T) requires two interface calls just to yield a single element*, and D's range interface is 50% worse, requiring three method calls per iteration! In D, this is rarely a problem because D ranges are used directly rather than through interfaces, so the compiler can inline the calls to all three methods. But in .NET land, interfaces are used far more often, so the three calls are fairly expensive.

The question, though, is whether it's worth requiring both approaches. The "chatty" approach with three methods is more convenient for the average caller who doesn't care much about performance, but the single-call method is crucial for high performance and should not be eliminated. So what should I do? Keep the four methods, drop all but the last one, or have some kind of compromise?

If extension methods were compatible with ranges, a very reasonable compromise would be to keep just IsEmpty and PopFirst, but eliminate First and DropFirst. Then, two extension methods would perform PopFirst without requiring a cumbersome "out" argument:
public static T PopFirst<R, T>(this ref R range, T defaultValue) where R : IFRange<T>
{
  bool fail;
  T next = range.PopFirst(out fail);
  if (!fail)
    return next;
  else
    return defaultValue;
}
public static T PopFirst<R,T>(this ref R range) where R:IFRange<T>
{
  bool fail;
  T next = range.PopFirst(out fail);
  if (fail) throw new EmptySequenceException();
  return next;
}
But as I explained, these extension methods cannot currently exist--not as extension methods, anyway. You can, however, call the non-extension method that does exist, Range.PopFirst(ref r). It's better than nothing, anyway.

Another compromise would be to eliminate DropFirst(), but keep First in case you want to "peek" at the next value without removing it (note that requiring a First property may increase the complexity or run-time size of a range type; and technically it is possible to implement First() as an extension method, so it is not strictly necessary for it to be part of the interface, although it is potentially faster when it is built-in.)

Finally, I am wondering whether to actually use the term "random-access range" or whether to use the term "slice" instead. As far as I can tell, in D the term "slice" means "array range", i.e. a section of an array, not some other data type. But "slice" rolls off the tongue nicely, and I am tempted to use it for all random-access ranges. In that case, IRange(T) would be renamed ISlice(T). Opinions, anyone?

In any case, I am in the process of adding range support to my collections library, Loyc.Collections. Note that the collection interfaces will go in Loyc.Essentials, a library of small methods, simple data structures and core interfaces, while most of the actual collection implementations will go in Loyc.Collections. I will also have to review the existing interfaces to understand how they relate to ranges and whether one or more of them should be eliminated or combined with ranges somehow. Perhaps I'll post again when it's done.

Loyc.Collections contains some data structures I've published articles about already, such as the AList, VList and CPTrie, but also some interesting unpublished data structures such as the BList, BMultiMap, Set/MSet, Map/MMap, InvertibleSet, and DList. Let me know if you would like me to prioritize writing articles about these things (I also have a parser generator and a new programming language in the works, you know, plus I have to find the time to set up a web site for all of this).

P.S. An ominous footnote:

The mutable ranges do not include a Clone method due to a limitation of C#. C# does not support covariance, which means that every time a derived interface supports cloning, the implementing class is required to write a separate clone method. Read-only ranges already have to implement up to three clone methods: ICloneable(IFRange(T)), ICloneable(IBRange(T)), ICloneable(IRange(T)), and that's in addition to the Clone method for the concrete type! If mutable ranges also supported cloning, they would add up to three more clone methods, which is really getting out of hand.

Even without mutable clone methods, implementing all the ICloneables can be a chore. But you can just copy and paste these, inserting an appropriate constructor call for your range type:
  IFRange<T>  ICloneable<IFRange<T>>.Clone() { return Clone(); }
  IBRange<T>  ICloneable<IBRange<T>>.Clone() { return Clone(); }
  IRange<T>   ICloneable<IRange<T>> .Clone() { return Clone(); }
  public MyType Clone() { return new MyType(...); }
When complete, EC# will, of course, take care of this crap for you.

Update: it was pointed out to me that although algorithms usually need ranges, there is other code that has some need for ordinary iterators. In .NET we typically use an index when we need an iterator, but this only works for random-access data structures, and it is not as type-safe as an iterator (although it has the virtue of storage efficiency.) For non-random-access data structures, the solutions are either ad-hoc (e.g. LinkedListNode inside a LinkedList) or non-existant (SortedDictionary has neither an iterator nor a range type; an iterator would have been one solution for this question I had). Comments are welcome about how to deal with the need for iterators in .NET.

I suppose the problem with C++ iterators is that they are useless without external context: you can't increment or decrement one without also comparing it to begin() or end() in its container, which implies that the caller must manually keep track of which container it came from. Thus, an iterator is hardly an improvement over a plain-old integer index, the only advantages being
  1. You can dereference it without reference to its container (but this is nothing special; you can dereference an object reference too)
  2. Unlike an index, it's compatible with non-random-access data structures (and huge data structures, for which a 32-bit integer index does not suffice).
Perhaps the iterator concept could be improved by being made self-aware: if the iterator "knew" and could tell you when it was at the beginning or end of its container. This would increase the storage requirement in some circumstances but not others. For example, an iterator to a doubly-linked list node can tell when it is at the beginning or end, but an iterator to a singly-linked list node can only tell when it is at the end (or rather, when there are no more elements after the current one. If the iterator follows the 'null' pointer at the end of a linked list, it becomes null itself, and can no longer increment or decrement).

A pointer inside an array may or may not be able to tell if it is at the end depending on how arrays work. Perhaps the way D heap arrays work would allow an array iterator, implemented as a simple pointer, to still know when it is safe to increment and decrement. The simplest possible .NET array iterator is an array reference plus an index, and again the iterator can know when it is at the beginning or end of the array--except that if the iterator came from inside a range, it would be unaware of the range's boundaries, which may be smaller than the array's boundaries.

The question is, what does an appropriate iterator type look like? Could it follow an interface that would make it more useful than a C++ iterator, but still more efficient than a range?

< Published on >
< Comments on Reddit > because Redditors don't like to comment on the actual blog.

Tuesday, May 28, 2013

Onward

After much soul-searching... actually, not that much... I've decided to continue working on Loyc and EC# even though I discovered a really nice language in Nemerle. Personally, I wouldn't mind switching development to Nemerle, but I didn't want to throw away all the work I'd already done on LLLPG, and I still think that the goals of the Loyc project cover a lot of ground that Nemerle is not intended to cover, so continuing to work on Loyc is not simply a waste of time.

Nemerle has a great macro system, and I have a lot to learn about how it works. Whatever I learn, I will document it and apply that knowledge to EC#. Assuming I'm smart enough to decipher how it works, EC# will end up with a great macro system like Nemerle's.

Loyc development will continue in C#, since (A) there is so much C# code already; (B) I'm going to write a EC# compiler and it should be self-hosting, which is impossible if I introduce Nemerle code; and (C) I want to attract new developers, and requiring them to use a new programming language might drive them away.

I wrote a big long blog post about LLLPG and a new syntax I'm developing called LES, but just as I was about to post it I discovered a serious design flaw in LES. I'll see if I can iron that out before the post... by which time LLLPG will be almost done! I just got simple syntactic predicates working this morning, so all that is left is the parser which will finally allow you to easily give LLLPG an input grammar.

Saturday, May 25, 2013

Status update: Loyc tree rewritten

I have made and implemented the changes that I planned to make to the Loyc tree. The code was rewritten from scratch; the new code is not as fancy or complete, but it's easier to use.
  • Red nodes are gone (and GreenNode is now called LNode, for "Loyc Node")
  • Mutable nodes are gone, and the data type for Args and Attrs has changed to RVList<T> (Reverse V-List--since this tends to make people think of Recreational Vehicles, I'm thinking about changing the name to simply VList, with the forward VList still being named FVList). RVList provides a natural way to support the common pattern of gathering a list of statements and then freezing them into a read-only node; you can create a list with RWList<LNode>, mutate it however you like (similar performance to List<T>), and then when the list is complete, call ToRVList() to convert it to RVList<LNode> instantly.
  • There are now three kinds of nodes: symbols (now called Ids, i.e. identifiers), literals, and calls. I'm considering whether to add a fourth type for nodes that have both a name and a value, perhaps calling it the Token kind (StdTriviaNode already has both a Name and Value, but it is currently a subclass of StdIdNode.).
  • An empty Name is now allowed. A literal now has a blank name (instead of #literal) and a method that calls anything other than a simple symbol will also have a blank Name. Note: the property will still never return null.
  • I decided to represent expressions in parenthesis using a call with the empty target @``, that is, the empty identifier. But now I am considering whether to eliminate the representation of parenthesis completely, because it's a significant inconvenience for node printers: if parenthesis are represented explicitly, a printer cannot add parenthesis to clarify precedence without changing the syntax tree. So far the EC# printer has managed to avoid adding parenthesis in almost all cases, but there's still the issue of "tightly bound attributes": if the "f" part of "f(x)" has an attribute, it must be printed with the special "grouping parenthesis", ##(), as in ##([attr] f)(x). It's clunky, but it works*. But now I'm designing another language (more on that in my next post) and I haven't figured out what to do about this yet. Perhaps I should just define an exception where "the parenthesis don't count if there are attributes inside"; we'll see.
During the redesign I decided on some small changes to the representation of certain expressions in EC#:
  • The '.' operator is now treated more like a normal binary operator; a.b.c is now represented #.(#.(a, b), c) rather than #.(a, b, c) mainly because it's easier that way, and because the second representation doesn't buy anything significant other than a need for special-casing.
  • int x = 0 will now be represented #var(int, x = 0) rather than #var(int, x(0)). I chose the latter representation initially because it is slightly more convenient, because you can always learn the name of the declared variable by calling var.Args[1].Name. However, I decided that it was more important for the syntax tree to be predictable, with obvious connections between normal and prefix notations. Since I decided that alias X = Y; was to be represented #alias(X = Y, #()), it made sense for the syntax tree of a variable declaration to also resemble its C# syntax. There's another small reason: C++ has both styles Foo x(y) and Foo x = y; if Loyc were to ever support C++, it would make sense to use #var(Foo, x(y)) and #var(Foo, x = y) for these two cases, and I believe C#'s variable declarations are semantically closer to the latter. (Note: another possibility was #var(int, x) = 0, but I decided this wasn't an improvement, it would just shift the pain around.)
  • An constructor argument list is required on all types using the #new operator, e.g. new int[] { x } must have an empty set of arguments on int[], i.e. #new(#of(#[],int)(), x); this rule makes the different kinds of new expressions easier to interpret by making them consistent with each other.
  • A missing syntax element is now represented by an empty symbol instead of the symbol #missing.
  • I've decided to adopt the "in-expression" generics syntax from Nemerle as an unambiguous alternative to angle brackets: List.[int] means List<int> and the printer will use this syntax in cases where angle brackets are ambiguous.
  • By popular demand, constructors will be written this(...) instead of new(...), since both D and Nemerle use the latter notation.
  • The \ and $ characters have been swapped; \S now denotes a symbol S, while $S now denotes a substitution. Originally EC# was designed just as an extension of C#, so \ made sense as a substitution operator for string interpolation because it doesn't hurt backward compatibility: "Loaded '\(filename)' successfully". But now that my focus has shifted to multi-language interoperability, $ makes more sense, as it is used for string interpolation in at least five other languages and it makes sense to use the same character for both string substitution and code substitution.
LLLPG is also improved; I have implemented two more major features:
  • Implemented support for terminals of unknown value. I observed that if the inputs are integers (or enums) LLLPG shouldn't have to know the actual integer value of each integer. The user should be able to use a symbolic name like "NUMBER" or "DQString" without actually telling LLLPG what the value is. So I eliminated the code-snippet generator for Symbols and replaced it with GeneralCodeGenHelper, which supports inputs of any data type: integers, enums, Symbols, strings, whatever. Note that switch() statements require constant cases, so switch() must be disabled for non-constant possibilities such as Symbols. A constructor argument controls whether GeneralCodeGenHelper is allowed to generate switch() statements (ideally individual terminal symbols could be marked constant or non-constant, but a global switch will suffice for now.)
  • Replaced redundant Match() calls with Skip(), i.e. if prediction learns that "LA(0) is in the set ('0'..'9')" then a test like MatchRange('0', '9', '.', '.') is redundant and can be replaced by Skip(). Also, I eliminated redundant Check() calls. These two features together are called "prematch analysis".
Only two major features are left and LLLPG 1.0 will be ready:
  • Syntactic predicate support (only semantic predicates work right now)
  • Input from source code, and a command-line interface. Currently grammars have to be constructed with overloaded C# operators and printed at run-time, which of course is clumsy.

Tuesday, May 7, 2013

Redesigning the Loyc tree code

The original (and current) design for the classes that represent Loyc trees has some problematic properties:
  1. You can't tell if a node is mutable or immutable without a runtime check.
  2. Green nodes are hard to work with because they use relative positioning.
  3. Red nodes (the Node class) are hard to work with because they have parents.
  4. The "f" part of "f(x)" is not (in general) an independent node--although it could be.
First of all, it is a hybrid mutable-immutable design: you can create trees that are mutable, and then freeze them to get immutable trees. This way, they support multiple programming styles depending on what you prefer. If you'd like to modify trees in-place, you can; if you like to work with immutable trees and contruct them bottom-up, you can. In theory, it sounds useful and flexible.

However, the original design does not benefit from type safety. You cannot say "this method requires a mutable tree" or "this method requires an immutable tree". It's non-obvious when a method takes or returns a "Node", what exactly that means.

I created an optionally-persistent hash-set recently that overcomes this problem by having two separate data types for mutable and immutable sets: Set<T> (immutable) and MSet<T> (mutable); both of these are wrappers around a helper data type InternalSet<T> which contains the algorithms, and the user can convert a set from mutable to immutable (or vice versa) in O(1) time, so I've been considering doing something comparable for Loyc nodes.

Parenting is another problem. My current implementation is inspired by Roslyn and has two kinds of nodes, green and red, the meaning of which is explained by Eric Lippert on his blog. The green nodes are cachable, so that for example the subtree for "Console.WriteLine" can be re-used to save memory if it appears in multiple places in a source file; they also use relative positioning to help support incremental parsing if I ever get around to implementing that. These factors allow Loyc trees to be very lightweight, but the latter fact makes it very hard to manipulate green trees without losing all information about positioning (or worse, causing some calculated positions to be incorrect). Therefore, end-users are generally supposed to use red nodes instead of green ones, leaving green nodes mainly as carefully-constructed products of the parser.

However, each red node has a parent, and this turns out to be very inconvenient when transforming syntax trees. Each node can only have a single parent, and if you insert a Node that has a parent as a child of another Node, you get an InvalidOperationException. The compiler cannot help detect the problem, and correcting the problem requires that you first "detach" the Node from its parent. To make matters worse, detaching is an O(N) operation (in general) because the array of Nodes in the parent has to be shifted left (after the spot where the detached Node used to be); plus, if the Node's parent is frozen, detaching causes a ReadOnlyException. Now, you don't actually have to detach; you can clone the node instead, but cloning all over the place could also hurt performance.

The way I have defined Loyc trees also makes me slightly uncomfortable. Recall that a node essentially has three parts:
  1. The attribute list (may be empty)
  2. The head (one of: a Symbol, a literal Value, or a Head node)
  3. The argument list (optional, and if present, may be empty)
(There are also cosmetic and diagnostic parts, such as the node's location in a source file, but these three parts are the "semantically important" parts.)

A couple of things make me uncomfortable about #2, the head portion. First of all, a call such as "f(x)" normally consists of two nodes: the call "f(x)" is one Node, and the symbol "x" is another Node. "f" is not a Node but merely a Symbol. The advantage of this representation is that it saves memory, since a separate Node does not have to be allocated for "f", we only need memory for the Symbol "f", and this memory is shared every time "f" appears in any source file. But I can certainly imagine that you might want to do something with "just the head part" of a node, ignoring the argument list and the attributes; for example you might want to print "just the head part" as a string, and there is no convenient way to do this right now.

A second problem is that there is a weird oddity in the code right now, because the Head part can be a Node rather than a Symbol, and my code currently treats the Node "f" as equivalent to the Symbol "f". This makes me uncomfortable because Loyc trees are not fully round-trippable like they are supposed to be; if you take the tree "f(x)" where "f" is a Node, print it as text and re-parse it, the output is "f(x)" where "f" is a Symbol--a slightly different tree that prints out the same way.

An alternative way to interpret this case is that if a Node serves as a Head and has no attributes or arguments, it should be interpreted as being in parenthesis. In that case, when "f" is a Node serving as a Head, it must have parenthesis around it: "(f)(x)". The reason I did not take this route is because allowing "f" to be a Node (without parenthesis) allows it to have its own positioning information. So in the case of code like "2 + 3" (which is "#+(2, 3)" as a Loyc tree in prefix notation), if "#+" is a separate node then it can have its own positioning information, indicating that the "+" sign is two characters to the right of its parent node "2 + 3". On the other hand, this positioning information is perhaps not needed for all method calls, which seems to be an argument in favor of allowing simple Symbols to be heads; and we cannot simply remove the "Symbol" part of a Node, for how would we represent a Node that is just a Symbol?

All of the above issues are "design smells", but I am not yet confident about how to eliminate the stink.

Here are my thoughts so far. First of all, I am convinced that immutable nodes are more convenient to work with and should be the default choice. Mutable nodes--assuming that I keep support for mutable nodes at all--should have their own data type, but I'm not sure if it should be a completely separate data type or a derived class. Certainly, it's good if code that analyzes immutable nodes can work with mutable nodes "for free", although some code will want to know "for sure" that no other code will modify a tree. The common interface between immutable and mutable nodes could be placed in an INodeReader interface, which allows mutable and immutable nodes to be completely separate, but code that operates on INodeReader instead of a base class would be slower, and implementing INodeReader is a burden on me because the interface needs its own special argument and attribute lists (it must return lists of INodeReader instead of the actual node type).

I have an idea that I think would allow nodes to be cached and relocated, without the inconvenience that currently makes green nodes problematic. If this idea works, the green nodes will become the preferred form in which to manipulate syntax trees, while red nodes would only exist as a way to track parents.

Dealing with parenting has been annoying and I want the problem to go away. I'm thinking about getting rid of parents--and red nodes--completely, at least temporarily while I finish off LLLPG, the parser generator. Perhaps the parentable nodes (red nodes) could be added later as a special kind of node, derived from the same base class as the green nodes; the red nodes may be simple wrappers, consisting of a reference to a green node plus a reference to a parent. All the methods would simply be forwarded to the green node.

Changing the subject now, I need to find a new way to look at Loyc nodes. Clearly, all Loyc nodes can have attributes and position/width/style information, so it's natural to have a common base class with common data. Apart from this common information, nodes represent the following forms:
  • Just an identifier: simple_symbol
  • Call an identifier: simple_symbol()
  • Just a literal: "literal"
  • Call a literal: "literal"() (weird, but syntactically legal)
  • Call a complex head: Console.WriteLine(), Foo(x)(y), Foo<x>(y)
  • Node in parenthesis: (x + 1). I'd like to represent parens explicitly in order to more faithfully represent the input, even though Loyc trees are not designed to support everything the way Roslyn and NRefactory can; for example, there's no obvious way to support round-tripping of syntax errors.
If there were algebraic data types in C#, I could represent the head and arguments using something like
data NodeData = Symbol
              | Literal object
              | InParens Node
              | Call Node List<Node>
In this representation, 'Node' is position information, plus attributes, plus a NodeData. A "Call" consists of the thing being called, plus a list of arguments. Or we could use the LISP representation of a call:
data LispNode = Symbol
              | Literal object
              | InParens Node
              | Call List<Node>
In the LISP style, the thing being called is the first item in the List (and in LISP, the List is a linked list rather than an array.) The LISP style definitely has advantages, but I'm not sure I'm comfortable mixing the call target with the call's arguments like this.

So what should I do in C#? I can certainly map the ADT to C# with something like:
public class Node {
    /* position information and attributes go here */
}
public class SymbolNode : Node {
    public Symbol Name { get; }
    ...
}
public class LiteralNode : Node {
    public object Value { get; }
    ...
}
public class ParensNode : Node {
    public Node Child { get; }
    ...
}
public class CallNode : Node {
    public Node Head { get; }
    public IList Args { get; }
    ...
}
The issues with this mapping are:
  • C# doesn't have features for deconstructing nodes easily; there's no 'match' statement
  • Lots of run-time conversions are required: you'd be downcasting constantly, harming performance. You could use visitors though, avoiding some casts (I expect casting to have a significant penalty because the classes above would not be sealed--in fact they'd probably be abstract, to allow support for red nodes, repositioning to account for editing in an IDE, etc.--and casting is rumored to cost more when the target type does not exactly match the actual type). I've never been a fan of the visitor pattern by the way, but it's grown on me recently as I finally grok it. The visitor pattern is brittle with respect to changes in the class hierarchy, but that's not really a problem for Loyc trees, because fundamentally new kinds of nodes should never be needed.
  • When you add red nodes and other wrapper classes, the total number of classes could be quite high.
  • A simple call such as "f()" will require two nodes, whereas the current implementation only needs a single node. It occurs to me, however, that the child node "f" could be implemented as a temporary object, constructed on-demand and immediately discarded (calling the Name property rather than Target would avoid constructing the temporary object).
  • Currently, a mutable node can be converted in-place between any of the possible node types, e.g. you can simply write node.IsCall = true to add an empty argument list. That kind of thing is not possible when we use these separate classes.
I guess I should keep in mind that EC# will support deconstruction and pattern-matching when it is complete, and the current mutation abilities are not very important to keep. The performance implications still matter, however, and I always love to save memory.

Saturday, April 20, 2013

The Loyc tree and prefix notation in EC#

Update: the Loyc tree code has been rewritten and the original concept has been modified from the concept described in this post (for example, the red/green tree concept has been dropped for the sake of simplicity and other reasons).

As I designed the macro system for Enhanced C#, I thought a lot about what kind of syntax tree would be appropriate--not just for EC# but for tools that could process multiple languages. One of Loyc's goals is conversion between languages, and another goal is to support multiple language syntaxes (perhaps each with a macro system!) in a single project. A third goal is to assist IDE tools (in particular, to be compatible with incremental parsing). And, as always, I wanted a data structure that is fast and uses memory efficiently.

I considered using LISP-style lists, but they didn't fit into C# very well. In particular I didn't feel that they were the right way to model complex declarations such as
 [Serializable] struct X<T> : IX<T> where T : class, Z { ... }
Besides, LISP lists don't hold any metadata such as the original source file and line number of a piece of code. And how would I address the needs of IDEs--incremental parsing, for example? So I fiddled around with different ideas until I found something that felt right for C# and other languages descended from Algol. Eventually I decided to call my idea the "Loyc node" or "Loyc tree". The Loyc tree is really two independent ideas that I developed concurrently: one idea for the concept of the tree as viewed by an end-user (i.e. other developers), and another idea for how the implementation works.

The concept of a Loyc node involves three main parts: a "head", an "argument list" and "attributes". The implementation involves two parallel trees, "red" and "green", inspired by a similar idea in Microsoft Roslyn.

A Loyc tree is basically "XML for code"--but wait a minute, I hate XML. Maybe "JSON for code"? Well, just as all XML elements have the same structure, all Loyc nodes have the same structure, also. Just as there is a clear and obvious mapping from XML to the code representation of XML (such as an XML DOM or XElement), the mapping from EC# to a Loyc tree is similarly transparent, when the EC# code is written in prefix notation.

In EC# prefix notation, the above struct "X<Y>" looks like this:
 [Serializable]
 #struct(#of(X, [#where(#class, Z)] T),
         #(#of(IX, T)),
         #{}( ... ));
But wait, let's start with some simpler examples:
 x += percent * 0.01;       // normal EC#
 #+=(x, #*(percent, 0.01)); // prefix notation
 
 public int X;              // normal EC#
 [#public] #var(#int, X);   // prefix notation
 
 using System.Collections.Generic;              // normal
 #import(#.(#.(System, Collections), Generic)); // prefix
     // (#using is reserved for the using(...) statement)

 if (condition) { y(); }    // normal EC#
 @#if(condition, #{}(y())); // prefix notation
 
 Foo(x, y, z);              // normal EC#
 Foo(x, y, z);              // prefix notation
In the pairs above, the EC# parser produces the same syntax tree for the two lines in each pair. EC# accepts prefix notation anywhere that a normal statement or expression is allowed. Prefix notation is based on function-call notation, as you can clearly see in the last example.

Note: the "@" in "@#if" avoids confusion between if-statements and the preprocessor directive "#if". The preprocessor in EC# is obsolete and not necessary, but it still exists. I wanted to use the "#" character to indicate a "special name" in prefix notation, but somehow we must avoid conflicts with the preprocessor. The parser converts the identifier "@#if" to "#if" internally, just as "@foo" actually means "foo" in plain C#.

The Loyc syntax tree

In most compilers, the syntax tree is very strongly typed, with separate classes or data structures for, say, variable declarations, binary operators, method calls, method declarations, unary operators, and so forth. Loyc, however, only has a single data type, Node, for all nodes*. There are several reasons for this:
  • Simplicity. Many projects have thousands of lines of code dedicated to the AST (abstract syntax tree) data structure itself, because each kind of AST node has its own class. Simplicity means I write less code and users learn to use it faster.
  • Serializability. Loyc nodes can always be serialized to a plain text "prefix tree" and deserialized back to objects, even by programs that are not designed to handle the language that the tree represents**. This makes it easy to visualize syntax trees or exchange them between programs.
  • Extensibility. Loyc nodes can represent any programming language imaginable, and they are suitable for embedded DSLs (domain-specific languages). Since nodes do not enforce a particular structure, they can be used in different ways than originally envisioned. For example, most languages only have "+" as a binary operator, that is, with two arguments. If Loyc had a separate class for each AST, there would probably be a PlusOperator class derived from BinaryOperator, with a LeftChild and a RightChild. But since there is only one node class, a "+" operator with three arguments is easy; this is denoted by #+(a, b, c) in EC# source code. The EC# compiler won't understand it, but it might be meaningful to another compiler or to a macro.
* In fact, there are a family of node classes, but this is just an implementation detail.
** Currently, the only supported syntax for plain-text Loyc trees is EC# syntax, either normal EC# or prefix-tree notation.

EC# syntax trees are stored in a universal format that I call a "Loyc tree". All nodes in a Loyc tree consist of up to four parts:
  1. An attribute list (the Attrs property)
  2. A Value
  3. A Head or a Name (if a node has a Head, Name refers to Head.Name)
  4. An argument list (the Args property)
The EC# language does not allow (2) and (3) to appear together (specifically, a Value can only be represented in source code if the Name is "#literal"). Therefore, you can think of Value, Head and Name as a discriminated union known as "the head part of a node". There is no easy and efficient way to represent a discriminated union in .NET, so all five properties (Attrs, Value, Head, Name, Args) are present on all nodes.

As you've seen, almost any Loyc node can be expressed in EC# using either "prefix notation" or ordinary code. The basic syntax of prefix notation is
 [attributes] head(argument_list)
where the [attributes] and (argument_list) are both optional, and the head part could be a simple name. For example, the EC# statement
 [Foo] Console.WriteLine("Hello");
is a single Node object with three children: Foo, Console.WriteLine, and "Hello". Foo is an attribute, Console.WriteLine is a Head, and "Hello" is an argument. Each of these children is a Node too, but neither Foo nor "Hello" have children of their own. The Head, Console.WriteLine, is a Node named "#." with two arguments, Console and WriteLine. The above statement could be expressed equivalently as
 [Foo] #.(Console, WriteLine)("Hello");
This makes its structure explicit, but the infix dot notation is preferred. Finally, Console and WriteLine are nodes that only have a Name (no Attrs, no Args, no Head, no Value).

Conceptually, Loyc trees have either a Head node or a Name symbol but not both. Foo, Console, WriteLine, and #. are all node names, while Console.WriteLine is a head node. However, you can always ask a node what its Name is; if the node has a Head rather than a Name, Name returns Head.Name. Thus, #. is the Name of the entire statement.

Attributes can only appear at the beginning of an expression or statement. Use parenthesis to clarify your intention if necessary, but please note that parenthesis are represented explicitly in the syntax tree, not discarded by the parser. Parenthesis cause a node to be inserted into the head of another node, so
 (x())
is a node with no arguments, that has a Head that points to another node that represents x(). Attributes have lower precedence than everything else, so
 [Attr] x = y;
associates the attribute Attr with the "#=" node, not with the "x" node.

Unlike C# attributes, EC# attributes can be any list of expressions, and do not imply any particular semantics. You can attach any expression as an attribute to any other statement or expression, e.g.
 [4 * y << z()]
 Console.WriteLine("What is this attribute I see before me?");
When the time comes to generate code, the compiler will warn you that it does not understand what the hell "4 * y << z()" is supposed to mean, but otherwise this statement is legal. Attributes serve as an information side-channel, used for instructions to macros or to the compiler. Macros can use attributes to receive information from users, to store information in a syntax tree temporarily, or to communicate with other macros.

You can mix prefix notation with "normal" EC# in various ways. For example, all of the following versions of "MulDiv" are equivalent:
 int MulDiv(int a, int b, int c) { 
     return (int)((long)a * b / c); 
 }
 int MulDiv(int a, int b, int c) { 
     #return((int)(#cast(a, long) * b / c)); 
 }
 #def(#int, MulDiv, #(int a, int b, int c), { 
     return (int)((long)a * b / c); 
 });
 #def(#int, MulDiv, #(#var(#int, a), #var(int, b), #var(int, c)), 
     #{}(#return(
         #cast((#/(#*(#cast(a, #long), b), c)), #int)
     )) 
 );

Statement/expression equivalence

An important feature of EC# is that the difference between "statements" and "expressions" is only syntactic, not semantic, i.e. the difference is only skin-deep. Any EC# expression can be a statement, and any statement can be written in the form of an expression. The EC# parser needs to know whether to expect expressions or statements in a given location, but once a syntax tree is built, the distinction between statements and expressions disappears. Here is an example that mixes statements and expressions:
 if (foo.Bar > 0 && #{
     var stats = GatherStatistics(foo);
     stats.Avg > foo.Bar && stats.StdDev < 1.0
 })
 {
     Frobnicate(foo, stats);
 }
Roughly, "#{" causes the parser to switch to statement mode, allowing you to write statements where an expression is expected. Of course, plain C# doesn't work this way. Therefore, the conversion to plain C# can sometimes be messy and difficult.

The EC# parser also does not distinguish between "executable places" and "places for declarations". For example, this code can be parsed:
    Console.WriteLine("This statement is illegal!");
    using System.Collections.Generic;

    void foo() {
        WriteLine("This statement makes more sense!");

        int Ecks {
            int x = 0;
            get { x }
            set { x = value; }
            namespace WTF {
                Console.WriteLine("This is pretty weird!");
            }
        }
    }
Although the parser understands this code, the EC# compiler (as a whole) would reject it, because code cannot run outside of methods (except macros), and namespaces cannot be located inside methods or properties.

Note: EC# is not usable yet. This post exists because I want to publish the idea of Loyc trees. I'll update this when the parser is ready. But once the parser is ready, you will still not be able to write code in EC#. Making a programming language by myself is a long process!

Thursday, April 18, 2013

Loyc and Nemerle

I created this blog to be about Loyc, the Language of Your Choice project, which is about making general tools for creating and extending programming language, converting code between languages, and creating tools for IDEs. However, quite simply Loyc is too big a project for one person. I couldn't find the time for it, and I couldn't quite figure out how to make an extensible programming system.

Last August I decided to reboot with something much more modest: to design and implement a series of improvements to C#. That way I hoped to gain enough experience to do something bigger, later. To increase my chance of success, I changed to a part-time worker at work, leaving more time for Loyc.

I was going to start by modifying NRefactory, and my "Enhanced C#" would simply compile down to plain C#. I couldn't figure out how to accomplish my end goal--a fully extensible language--so I decided to simply improve the existing C# language with a series of specific and modest features such as the "null dot" or safe navigation operator (now denoted ??.), the quick-binding operator (::), forwarding clauses, return value covariance, C++-style templates, "static if", blah blah blah.

However, once I was done drafting the new language, I noticed that despite all the work I'd put into merely the draft alone, the new language still didn't address one of the most requested features from C# users: "Provide a way for INotifyPropertyChanged to be implemented for you automatically on classes".

INotifyPropertyChanged is a simple interface for allowing code to subscribe to an object to find out when it changes. For example:
interface INotifyPropertyChanged {
 event PropertyChangedEventHandler PropertyChanged;
}
class Person : INotifyPropertyChanged {
 public event PropertyChangedEventHandler PropertyChanged;
 
 string _name;
 public string Name
 {
  get { return _name; }
  set {
   if (_name != value) {
    _name = Value;
    Changed("Name");
   }
  }
 }
 string _address;
 public string Address
 {
  ... same thing ...
 }
 DateTime _dateOfBirth;
 public DateTime DateOfBirth
 {
  ... same thing ...
 }
 void Changed(string prop)
 {
  if (PropertyChanged != null)
   // The .NET "EventArgs" concept is stupid, but I digress
   PropertyChanged(this, new PropertyChangedEventArgs(prop));
 }
}
The problem is that you need 10 new lines of code for every new property in your class. Couldn't the language have some kind of feature to make this easier? But I didn't want to have a feature that was specifically designed for INotifyPropertyChanged and nothing else. Besides, there are different ways that users might want the feature to work: maybe some people want the event to fire even if the value is not changing. Maybe some people want to do a reference comparison while others want to do object.Equals(). Maybe it can't be fully automatic--some properties may need some additional processing besides just firing the event. How could a feature built into the language be flexible enough for all scenarios?

I decided at that point that it was time to learn LISP. After studying it briefly, I figured out that what would really help in this case is macros. Macros would allow users to provide a code "template" that is expanded for each property. To make a medium-length story short, I came up with an interchange format for syntax trees called "Loyc trees", and I dropped my EC# 1.0 draft--which was never made--to create EC# 2.0; many of the features of EC# 1.0 could be accomplished with macros anyhow. But since I wasn't satisfied with any of the readily-available parser generators, I also decided to make my own LL(k) parser generator. The initial version is about 75% complete right now (no, no, just the parser generator, sorry); the main missing features are syntactic "and" predicates (an optional feature that provides unlimited lookahead), and the ability for the parser generator to parse its own input (I'm using C# operator overloading for now.)

So the other day I finally started writing the parser for EC# 2.0, which would also be the parser generator's native language... when I rediscovered Nemerle.

I'd looked at Nemerle a few years back, but it didn't seem very impressive at the time; it just appeared to be a C#-like language with some type inference, algebraic data types, and pattern matching (and my memory is bad but I think it mainly targeted Linux (Mono)). Those are all good things, but my interest has always been in extensibility. I never knew until now that in fact Nemerle already has LISP-style macros (well, not quite LISP-style: a Nemerle macro cannot exist in the same project as the code to which it is applied; but that's okay because EC# wasn't going to support such an advanced feature either (not at first, although I want to support it eventually).)

I don't know whether Nemerle has always been extensible, and I simply didn't realize it, or whether macros are a new feature. In either case, this discovery will certainly impact my development of Loyc and EC#. In principle, Nemerle is so similar to EC# that I should perhaps even halt development of EC# and just use Nemerle.

My first impression is that Nemerle's macros seem to lack the simplicity of my design, but they are clearly more powerful. So far, I have only figured out how to support syntactic macros in EC#. I haven't figured out how to allow a macro to access semantic knowledge of a program--to look up types and members, or to introspect the "meaning" of some code and then to modify that code. Nemerle seems to have multiple "flavors" of macros that can accomplish different things, which is very exciting. Another neat fact is that both myself and Nemerle's people decided to use what I call a "tree parser"--the lexer output is grouped by parenthesis, square brackets and braces, before it is sent to the main parser. This makes it easier to support extensibility (although EC#'s parser will not be extensible), and makes it easy for the parser to look beyond anything in parentheses.

So where do I go from here? I'm not sure yet. I'll see what I can learn about the Nemerle compiler and macro system, before I can decide on a course of action. I am particularly interested in the base language--the language before any macros are added--but I have found only one page on that topic, and the beginning of that article is hard to understand. Oh well. (Argh! They keep saying that basic features of the language such as the "if" expression are defined by macros, but since Nemerle has no "goto" statement, I can't imagine how the "if" expression could decompose into anything simpler.) I also found this "extended course" in macros, and other information.

Wikipedia's Nemerle page claims that "the core developers of Nemerle were hired by the Czech software development company JetBrains." However, the core developers appeared to be Michał Moskal and Kamil Skalski, and Google seemed to indicate that those two worked for Microsoft and Google respectively, not JetBrains. So I contacted Michał, who told me that neither of them were working on Nemerle! He said there "were a few people, mostly from Russia, still working on it." But who exactly is left? I have no idea.

The webpage is in disarray. The first time I tried to download Nemerle, I found the download link on the front page of the wiki, but this points to downloads that are 2 years old, making me wonder if Nemerle was dead (I found the correct download link later). I wanted to talk to Nemerle's developers, but the link to the "Forum" on Nemerle's front page was broken, so next I found this contact page, but the links to the mailing list archives do not work, and the subscription link requires login credentials. Finally I found this "news" page, which hasn't been updated since 2004! Finally, Michał pointed me to Nemerle's Google Group, but clicking "New Topic" caused the error "An error occurred while communicating with the server." Yikes!

Finally, I was able to ask a couple of questions about Nemerle--but so far, the only response is from someone that cannot understand my English. It turns out that the first language of Nemerle is Russian! Could it be that I won't be able to work with the Nemerle people due to a language barrier?

In conclusion, we should all learn Esperanto or something.
Por finiĝi, ni ĉiuj devus lerni Esperanton aŭ ion.

(I used to be very interested in the language boo, but the boo people wouldn't talk to me, and in general did not document boo's advanced features at all. I don't know if the Nemerle people will talk to me, but at least Nemerle offers far more articles about its features than boo ever did, even if you only count the English articles (note, it's been a couple of years now since I checked on boo).)

Tuesday, October 30, 2012

Lisp, here's why the mainstream ignores you.

While I currently don't know how you write a "hello world" in LISP, I am keenly aware that the LISP language, with its powerful macro system, could be very instructive on the topic of metaprogramming. However, it is proving difficult to quickly learn LISP because the community uses terminology, technology and notations that are far outside the mainstream.

The most obvious thing is the parenthesis. While a prefix notation such as
    (eqv? (* (+ (2 3)) x) 20)
    meaning (2 + 3) * x == 20
gives LISP its power as a programming language, it's also a huge turnoff. I can't figure out why they haven't standardized on a syntactic sugar such as sweet expressions that maps in a predictable way back to S-expressions. Even sweet expressions look a lot different than popular languages, but it's clearly an improvement.

I went looking for information on the macro system and the first search result for "LISP macros" is some university lecture notes. It suggests "macroexpand-1" as a debugging tool so I Googled that next, and came upon this page from Common Lisp the Language, 2nd Edition, which states:
A form is considered to be a macro call only if it is a cons whose car is a symbol that names a macro.
Ahh... say what? Well, I guess I can handle learning what a "cons" is (Wikipedia implies that a list like (1 2 3) is shorthand for the singly-linked list (cons 1 (cons 2 (cons 3 nil))), which is a bit surprising since everything in LISP is made out of lists, and the performance and memory consumption of singly-linked lists is known to be rather bad on modern 64-bit machines, but I digress...

Anyway it's slightly amazing that the terms "car" and "cdr" are still in use:
Lisp was originally implemented on the IBM 704 computer, in the late 1950s. The 704 hardware had special support for splitting a 36-bit machine word into four parts, an "address part" and "decrement part" of 15 bits each and a "prefix part" and "tag part" of three bits each.
...
  • car (short for "Contents of the Address part of Register number"),
  • cdr ("Contents of the Decrement part of Register number")
Considering that LISP was a potentially revolutionary idea, I can forgive it for using these terms... in 1959. But to be using them in the 21st century? It's a boneheaded way to alienate potential users.

I'm not totally new to car and cdr, I've seen it before, but I've had to look up car and cdr about three times now because I keep forgetting which one is which. I think I'll just remember that "a" is the first letter of the alphabet so "car" fetches the first item in a list, while "d"... well, "d" is a different letter of the alphabet. The Wikipedia article even seems to think these names are advantageous:
However, car and cdr have the advantage that short compositions of the functions can be given short and more or less pronounceable names of the same form. In Lisp, (cadr '(1 2 3)) is the equivalent of (car (cdr '(1 2 3))); its value is 2 (the first item of the rest of (1 2 3)). Similarly, (caar '((1 2) (3 4))) is the same as (car (car '((1 2) (3 4)))); its value is 1.
Riiiiight. So you could use "cadr" to mean "first of the rest" and you could use "caar" for "first of the first". Or, why not use something logical like "second" and "first.first". Or if brevity is really that important, "r.f" and "f.f" - but it's probably not that important.

So then I figured I'd download LISP and play around with it. I knew Common Lisp was a ... well, a common LISP, so I Googled that. The first four results were sites about common lisp without downloads; the third was the book Practical Common Lisp which suggested Lisp in a Box, linklessly. Searching for Lisp in a Box finds "LispBox" instead, although it might be the same thing: the book says Lisp in a Box uses emacs as its editor, while Lispbox's page says the same thing.

But I don't want to use emacs! I used that bastard a little in university a bit but ugh, it's just an ugly text window with a weird interface that is different from every editor that isn't emacs. Almost every other editor is standardized on Ctrl+S for Save, Ctrl+F for Find, Tab for indenting text, Ctrl+C for copy, etc. But not ugly old emacs, oh no. (Yeah, okay, macs prefer the ⌘ key... if I had a mac I'd probably have to remap it to Ctrl.)

They even have a different notation for keyboard shortcuts than everybody else: "C-h t" where the rest of the world would probably say "Ctrl+H, T" (except that the rest of the world wouldn't use multiple-key shortcuts for anything important.) Even worse, referring to the Alt key as "M-"? It's crazy, like they're waiting for the 80s LISP machines to come back.

Of course, if you're used to emacs then it must be the rest of the world that seems weird. But there's an easy solution: on first startup, emacs could offer a choice of "Traditional emacs keyboard interface" and "Normal keyboard interface (recommended for new users)", the latter being based on Windows and Mac programs. To decide how to assign other keys, they could copy bindings from the most popular IDEs, Visual Studio and Eclipse.

The Lispbox downloads are oddly headlined "Lispbox test builds", but whatever. So I download the Windows one--90 MB compressed? Son of a bitch! That's what happens when you choose emacs as your editor--okay, okay, it's only half of the total size, and maybe I'm being unfair since some IDEs are bigger, but wasn't LISP supposed to be lightweight?

Thankfully, emacs offers a "C-x/C-c/C-v Cut and Paste" option, but "Delete" still doesn't delete selected text, "Ctrl+F" doesn't work and "Ctrl+Z" minimizes the damn window, and maybe worst of all, "Tab" either doesn't do anything or indents by a seemingly random amount, depending on the context (I'll figure that out later).

It's getting late, maybe I'll finish this later... if anyone can name a non-emacs LISP environment, I'd like to know.
----

The key problem I'm facing right now in my exploratory design of macros for EC# is the question of how to order macro expansions with respect to each other and with respect to substitution operations (which, for all I know, should themselves be implemented with macros) and what the rules for name resolution should be. However, reading about LISP has provided little insight so far, because (1) I would have to understand LISP fairly well in order to understand the documentation about macros, or extremely well in order to understand any technical specifications; and (2) in tutorial-level material, the ordering of operations, the name resolution rules, and the compilation model are implicit and assumed to be of secondary importance. Quite reasonably they assume you want to use LISP, whereas I just want to shamelessly copy the ideas behind LISP and the lessons that were learned in the course of developing modern LISPs.

Perhaps this will help me. Since much of LISP is defined with macros, I had also been wondering which parts of LISP were built-in and which were macros. The key to finding this out seems to be the term "special operator" aka "special form", which is like a keyword or built-in statement in other languages. So far I haven't found a complete list of them though.

Wednesday, September 12, 2012

Opus, the codec to end all codecs

Opus is a new all-purpose low-latency audio codec with better quality than MP3 and better than Vorbis at low bitrates. Plus it's an open standard, open source, and patent-free! Not only that, it operates over bitrates from 6 kbps (for low-quality speech) up to 510 kpbs. Thus it is suitable to replace virtually every audio codec in the world!* As a bonus, it's an official IETF standard and it has been chosen a mandatory-to-implement (MTI) codec for WebRTC, an upcoming standard for real-time communication on the web.

Well, I had a look at the RFC. The technical details are largely impenetrable… for example the introduction to Range Coding should really say something about what fl[k], fh[k], and ft represent (and what’s [k], an array subscript?), and they really should not have waited until section 4.1.3.3 to introduce the concept of a Probability Density Function, since it appears to be a fundamental idea...

...Anyway, Opus is basically two codecs in one: a linear prediction (LP) codec for speech (based on a standalone codec called ‘SILK’) and a MDCT-based codec for music (based on a standalone codec called ‘CELT’). An Opus encoder can instantly and seamlessly switch between the two codecs at any time according to the bitrate and the dominant character of the input, and there is a hybrid mode where SILK is used for low frequencies and CELT for high frequencies.

CELT is built on the same psychoacoustic principles as other modern codecs (e.g. Vorbis), but adds innovations to permit low latency without hurting overall audio quality. SILK is designed for low bandwidth and low-latency speech (though the principles behind it remain totally opaque to me), while CELT is better for high bandwidth audio (music or speech or anything else) but can still operate at very low bitrates when the sound is not speech or speech-like and therefore unsuitable for SILK.

The operating modes of Opus are designed so that the two codecs “fit together” nicely, and it can switch between the two on 10ms boundaries (sampling range can similarly change among 8/12/16/24/48 kHz, ditto for mono/stereo). Opus also offers other real-time and VOIP features, such as CBR and forward error correction and other tricks to compensate for packet loss.

Some final niceties of Opus: the encoder and decoder can operate at independent sampling rates (or one can be mono and the other stereo), and the (continuously variable) bitrate is independent of the operating mode (e.g. want stereo at 48 kHz at 8 kbps? It’s probably a bad idea, but it’s not outright prohibited). So basically Opus covers virtually all scenarios. The codec to end all codecs, indeed. (correct me if I’m wrong about any of this, dear experts).

Now can I just get a codec that achieves very low bitrates by observing long-term audio patterns and making backward references to similar audio many seconds in the past? kthxbye :) Codec2 (presentation) is really fascinating too. It was developed for radio applications, and is not really useful for VOIP because Codec2 packets are much smaller than UDP headers, i.e. the overhead of the Internet itself outweighs the space savings of Codec2. But what I find interesting about it is the potential for cheap archiving. For instance, with Codec2 you could record your entire life: at 2400 bits/second it requires just 25.9 MB (non-MiB) per day and 9.46 GB per year or under 1 TB for a lifetime (or much less, assuming you eliminate regions of silence and reduce the bitrate in nearly-silent situations.) Coupled with voice recognition and a search index, you'd finally be able to resolve any arguments with your wife about who said what last week. I bet some sociologists and linguists would find interesting uses for such a corpus...

Tuesday, July 17, 2012

Constructors Considered Harmful

Ahh, constructors, constructors. In my opinion the constructor design in most languages including C++, C#, D and Java is flawed, because it exposes an implementation detail that should not be exposed, namely, which class gets allocated and when.

I offer you as "exhibit L" the Lazy<T> class in the .NET framework. This class's main purpose is to compute a value the first time you access it. For example:
 int x;
 Lazy<int> lazy = new Lazy<int>(() => 7 * x);
 x = 3;
 x = lazy.Value; // 21
 x = lazy.Value; // still 21 (Value is initialized only once)}
There is an annoying issue, though: Lazy<T> operates in some different modes, and it also contains a member and extra code to aid debugging. So in addition to holding the value itself and a reference to the initializer delegate, it's got a couple of other member variables and the Value property has to check a couple of things before it returns the value. Thus, Lazy<T> is less efficient than it should be.

So what if, in the future, MS decided to optimize Lazy<T> for its default mode of operation, and factor out other modes into derived class(es)? Well, they can't ever do that. If MS wants to return a LazyThreadSafe<T> object (derived from Lazy<T>) when the user requests the thread-safe mode, they can't do that because all the clients are saying "new Lazy", which can only return the exact class Lazy and nothing else.

MS could add a static function, Lazy.New(...) but it's too late now that the interface is already defined with all those public constructors.

As "exhibit 2", a.k.a. "exhibit Foo", I recently wrote a library where I needed to provide a constructor that does a bunch of initialization work (that may fail) before it actually creates the object. By far the most natural implementation was a static member function:
 // Constructor with dependency injection
 public Foo(A arg1, B arg2, C arg3)
 {
 }
 // Static helper method provides an easier way to create Foo
 public static Foo LoadFrom(string filename, ...)
 {
     ... // do some work
     arg1 = ...; arg2 = ...; arg3 = ...;
     return new Foo(arg1, arg2, arg3);
 }
However, the client didn't like that, and insisted that LoadFrom() should be a constructor "for consistency" with the other constructors. I was able to rearrange things to make LoadFrom() into a constructor, but my code was a little clunkier that way.

So what I'm saying is, when clients directly allocate memory themselves with new(), it constrains the class implementation to work a certain way, and this constraint is unnecessary.

A final problem with constructors is that they can't have names.

For D I suggested a simple solution, which in principle could be added to any language that uses "new". The solution is to treat constructors as equivalent to static methods named "new". So when the user writes "new Foo(...)", this call might go to a real constructor, or it could go to a static method named "new" that returns a Foo. Likewise, if the user writes "Foo.new(...)", this call could go to a static method or to an actual constructor.

This tweak would allow the class Foo to change the implementation of a constructor to return an object derived from Foo, instead of Foo itself. It would also allow Foo to do some work before actually creating a Foo object. Finally, this proposal doesn't allow named constructors per se, but it does provide a similar syntax for named and unnamed creation methods, because the syntax to call an unnamed constructor, "Foo.new(...)" is similar to the syntax to call a named creation method, "Foo.newFromFile(...)".

I don't expect D or any other language to accept my proposal, but it's a good idea so I'm blogging it anyway. In .NET, this feature would be more practical if .NET compilers were to automatically create a static "new" method for every real constructor, and to prefer to emit calls to the static method (if present) instead of the real constructor. This rule would provide for binary compatibility, which improves upon the source-level compatibility provided by my syntax proposal.