Thursday, June 13, 2013

LLLPG, Loyc trees, and LES, oh my!

Background: LLLPG —

I normally don't let these internal "planning" posts go to CodeProject, but my planning has reached an advanced stage and I think it would be fun to hear some feedback on my new parser generator and my plan for a new programming language called LES or LEL (depending on what you're doing with it). The parser generator is called LLLPG (Loyc LL(k) Parser Generator). "Loyc" means "Language of Your Choice" and it is my project for creating a set of generic tools for creating programming languages and for translating code between different languages* ; meanwhile, LL(k) is one of the three most popular types of parsers for computer languages (the others being PEG and LALR(1)... perhaps ANTLR's LL(*) is more popular than LL(k) nowadays, but I'm counting LL(*) as a mere variation on LL(k).)

LLLPG is a recursive-decent parser generator, with a feature set (and syntax) comparable to ANTLR version 2. LLLPG 1.0 is about 80% complete. All that's missing is a parser that can accept grammars in the form of text, and a bit of polishing, but the last 15% is the most important--no one will want to use LLLPG without a nice friendly input language, and right now LLLPG cannot parse its own input.

And I won't be using a dedicated input language like ANTLR does; no, LLLPG is designed to be embedded inside another programming language. Initially you will use LLLPG the same way you would use ANTLR, by invoking an exe as a pre-build step at compile-time, to produce C# code which is then compiled. But eventually, LLLPG will merely be a syntax extension to another language, one of a hundred extensions that you might have installed. You will add LLLPG as a compile-time reference, the same way you might add "System.Core" as a run-time reference today. This means that using a parser generator will be no more difficult than using any other library. When I'm done, you'll be able to easily embed parsers anywhere you want them; creating a parser that is generated and optimized at compile-time will be no more difficult than making a "compiled Regex" is today. But of course, a "compiled Regex" is actually compiled at runtime, while LLLPG will work entirely at compile-time*.

LLLPG has a very different design philosophy than ANTLR. ANTLR is a "kitchen sink" parser generator; it tries to do everything for you, it requires its own runtime library, and if you really want to be good at ANTLR, you'll buy the book that explains how to use it. I can't compete with ANTLR on features; Terrance Parr has been writing ANTLR and its predecessor for almost 20 years now, I think. Instead, LLLPG is designed for simplicity and speed, and its features tend to be orthogonal rather than intertwined; it is fairly easy to learn, it tends to generate fast parsers, the generated code is simple, easy-to-follow, and C#-specific, and it does not need a special runtime library, just a single base class that you can write yourself or copy into your project. Here is its feature set:
  • It generates a prediction-based LL(k) parser with no backtracking. The value of k is the maximum lookahead; it can be customized separately for each rule.
  • It warns you about ambiguity, and you can easily suppress irrelevant warnings (earlier alternatives take priority). Like ANTLR, LLLPG warns by example ("Alternatives (2, 4) are ambiguous for input such as «0x_»"). In case of ambiguity with an exit branch, you can mark loops as greedy or nongreedy to suppress the warning.
  • It supports semantic predicates, denoted &{expr}, which are used for disambiguation. If two branches are ambiguous, you can use a semantic predicate to put a condition on one or both branches that will be used to disambiguate; expr is simply a C# expression.
  • Syntactic predicates, denoted &(foo), are very similar, except that instead of running a C# expression, LLLPG tests whether the input matches the syntax "foo". Syntactic predicates, also known as zero-width assersions, can perform unlimited lookahead, so you can use them to escape the limitations of LL(k) grammars.
  • A gate, denoted p => m, is an advanced (but very simple) mechanism that separates "prediction" from "matching".
  • Performance optimizations include "prematch analysis", switch() statement generation, saving the lookahead terminal(s) in variables, and not using exceptions anywhere.
  • Alts (alternatives, i.e. branch points) have two error-handling modes for unexpected input. LLLPG can either (1) simply run the "default branch", or (2) produce an error-handling branch (automatically or manually). The default branch is normally the last branch, but this can be overridden (either to assist error handling or as a performance optimization.)
  • Actions, denoted {code;}, are code snippets inserted into the parser.
  • The value of a terminal can be assigned to a variable using x:='a'..'z' to create a variable 'x', x='a'..'z' to assign the value to an existing variable 'x', or xs+='a'..'z' to add the value to a list 'xs'.
When LLLPG is formally released, I will repeat all of this in a CodeProject article, or a whole series of them, with more detail. For now, the above information is just background for the real subject of this post.

Bootstrapping LLLPG

When you're making a parser generator that accepts text input, you naturally want to express the parser generator's own syntax in its own language, and if possible, you'd like to use your own parser generator to parse that language. But without any parser, how do you begin? The process of making a parser that will be able to "parse itself" is called bootstrapping.

I started out by taking advantage of C# operator overloading. I am currently writing parsers using the following C# code:
Final syntax C# syntax
'x' C('x') -- or Sym("foo") for the symbol \foo
'a'..'z' Set("[a-z]")
'[' OtherRule ']' C('[') + OtherRule + ']'
a | b | c a | b | c
a / b / c a / b / c -- but a b / c d needs parens: (a + b) / (c + d)
Foo* Star(Foo)
Foo+ Plus(Foo)
Foo? Opt(Foo)
&{i==0} And("i==0") -- using a special #rawText node to hold "i==0"
{Foo(bar);} Stmt("Foo(bar)")
a => b Gate(a, b)
Note that my unit tests have to use the C# syntax, since the tests have to be written before the big grammar that describes LLLPG's final input language.

My original plan for bootstrapping was to make a parser for EC# in C# with LLLPG (EC# is Enhanced C#, and one of its features is a "token literal" that allows embedded DSLs), or at least a parser for just the parts of EC# I needed for bootstrapping. Once this partial parser was written using the above mappings, I could then write a second parser in "partial EC#", and this second parser would be able to parse full EC# as well as itself.

However, as I planned out the parser, I was rather annoyed at how complex the EC# language is (because of its progenitor C#, of course) and I am now planning to create a new syntax that is easier to parse, tentatively called LES, "Loyc Expression Syntax"*.

The syntax tree interchange format

A major part of my plans for Loyc is the concept of an "interchange format" for code. In three words, the concept is "XML for code"--a general representation of syntax trees for any language. The interchange format will
  • Assist tools that convert code between different languages, by providing a way to change a tree incrementally from one language until it conforms to the rules of another language.
  • If a compiler uses Loyc trees internally, it can assist people writing compilers and compiler extensions by providing a simple textual representation of the compiler's intermediate output at various stages.
  • Promote compile-time metaprogramming to the entire software industry.
Let me be clear, I do not advocate XML as an interchange format. In fact, I hate XML! If I have to type < one more time I could scream. Rather, I want to create a standard language for describing syntax trees, just as XML or JSON is used to describe data. (In fact, my language should be useful for describing data too, just like s-expressions can describe data, although that is not its main purpose).

The interchange format will serialize and deserialize text into an in-memory form called a Loyc tree. Every node in a Loyc tree is one of three things:
  • An identifier, such as the name of a variable, type, method or keyword.
  • A literal, such as an integer, double, string or character.
  • A "call", which represents either a method call, or a construct like a "class" or a "for loop".
Unlike in most programming languages, Loyc identifiers can be any string--any string at all. Even identifiers like \n\0 (a linefeed and a null character) are supported. This design guarantees that a Loyc tree can represent an identifier from any programming language on Earth. Literals, similarly, can be any object whatsoever, but when I say that I am referring to a Loyc tree that exists in memory. When a Loyc tree is serialized to text, obviously it will be limited to certain kinds of literals (depending on the language used for serialization).

Each Loyc node also has a list of "attributes" (usually empty), and each attribute is itself a Loyc tree.

Loyc trees are closely related to, and inspired by, LISP trees. If you've heard of LISP, well, the Loyc tree is basically a 21st century version of the S-expression. The main differences between s-expressions and Loyc trees are
  1. Each "call" has a "target". Whereas LISP represents a method call with (method arg1 arg2), Loyc represents a method call with method(arg1, arg2). In LISP, the method name is simply the first item in a list; but most other programming languages separate the "target" from the argument list, hence target(arg1, arg2). If you want to create a simple list rather than a method call, the identifier "#" (tentatively) means "list" by convention, as in #(item1, item2, item3).
  2. Each node has a list of attributes. The concept of attributes was inspired by .NET attributes, so it will not surprise you to learn that a .NET attribute will be represented in a Loyc tree by a Loyc attribute. Also, modifiers like "public", "private", "override" and "static" will be represented by attaching attributes to a definition.
  3. A "call", which represents either a method call, or a construct like a "class" or a "for loop". By convention, constructs that are built into a language use a special identifier that starts with "#", such as #class or #for or #public.
  4. Each node has an associated source file and two integers that identify the range of characters that the original construct occupied in source code. If a Loyc tree is created programmatically, a dummy source file and a zero-length range will be used.
Obviously, a text format is needed for Loyc trees. However, I think I can do better than just an interchange format, so I have a plan to make LES into both an interchange format and a general-purpose programming language in its own right.

Since LES can represent syntax from any language, I thought it was sensible to design it with no keywords. So tentatively, LES has no reserved words whatsoever, and is made almost entirely of "expressions". But "expressions" support a type of syntactic sugar called "superexpressions", which resemble keyword-based statements in several other languages.

I've made a somewhat radical decision to make LES partially whitespace-sensitive. I do not make this decision lightly, because I generally prefer whitespace to be ignored inside expressions*, but the whitespace sensitivity is the key to making it keyword-free. More on that later.

My original plan was to use a subset of Enhanced C# as my "XML for code". However, since EC# is based on C# it inherits some very strange syntax elements. Consider the fact that (a<b>.c)(x) is classified a "cast" while (a<b>+c)(x) is classified as a method call (As I am writing this in HTML, the < made me scream just now. Sorry for the noise.) Features like that create unnecessary complication that should not exist in an AST interchange format.

Meet LES: tentative design

As I describe the design for Loyc Expression Syntax, keep in mind that this is all tentative and I could be persuaded to design it differently. I have the following goals:
  • Concision: LES is not XML! It should not be weighed down by its own syntax.
  • Familiarity: LES should resemble popular languages.
  • Utility: although LES can represent syntax trees from any language, its syntax should be powerful enough to be the basis for a general-purpose language.
  • Light on parenthesis: I'm sick of typing lots of parenthesis in C-family languages. LES is my chance to decrease the amount of Shift-9ing and Shift-0ing that programmers do every day.
  • Simplicity: the parser should be as simple as possible while still meeting the above goals, because (1) this will be the default parser and printer for Loyc trees and will be bundled with the Loyc tree code, so it should not be very large, and (2) this is the bootstrapping language for LLLPG so (for my own sake) it can't be too complex.
First let me describe a simple subset of LES, the "prefix notation". Prefix notation is nothing more than expressing everything as if it was made out of method calls. Basically it's LISP, but with Loyc trees instead of S-expressions.

For example, C# code like
   if (x > 0) {
      Console.WriteLine("{0} widgets were completed.", x);
      return x;
   } else
      throw new NoWidgetsException();
might be represented in prefix notation as
   #if(@#>(x, 0), @`#{}`(
      @#.(Console, WriteLine)("{0} widgets were completed.", x),
      #return(x)
   ),
      #throw(#new(NoWidgetsException()))
   );
The prefix "@" introduces an identifier that contains special characters, and then the backquotes are added for identifiers with very special characters. So in @`#{}`, the actual identifier name is #{}, which is the built-in "function" that represents a braced block. Similarly @#> and @#. represent the identifiers "#>" and "#.". Please note that '#' is not considered a special character, nor is it an operator; in LES, # is just an identifier character, just as an underscore _ is an identifier character in the C family. However, # is special by convention, because it is used to mark identifiers that represent keywords and operators in a programming language.

Pure prefix notation is a bit clunky, so it will not be used very much. LES will have built-in operators so that the above code can be represented in a more friendly way:
   #if x > 0 {
      Console.WriteLine("{0} widgets were completed.", x);
      #return x;
   }
      #throw #new(NoWidgetsException());
However, this notation might be confusing when you first see it. "Why is there no #else?" you might wonder. "How do I even begin to parse that," you might say.

Well, here's my plan:
  • Braces {...} are simply subexpressions that will be treated as calls to the special identifier #{}. Inside braces, each argument to the #{} pseudo-function will end with a semicolon (the semicolon on the final statement is optional).
  • To make an identifier that contains punctuation, use the @ prefix. Such identifiers can contain both punctuation and identifier characters, e.g. @you+me=win!. Note that when I talk about "punctuation", the characters , ; { } ( ) [ ] ` " # do not count as punctuation (# counts as an identifier character). If you need extra-special characters such as these, enclose the identifier in backquotes: @`{\n}`. Escape sequences are allowed within the backquotes, including \` for the backquote itself. Identifiers can contain apostrophes without using @, e.g. That'sJustGreat (there are several languages that use apostrophes in identifiers). However, an apostrophe cannot be the first character or it will be interpreted as a character literal (e.g. 'A' is the character 'A').
  • There will be an unlimited number of binary operators such as +, -, !@$*, >>, and so on. An operator such as &|^ will correspond to a call to #&|^, e.g. x != 0 is simply a friendly representation of @#!=(x, 0). Simplicity being an important design element of LES, the precedence of each operator will be determined automatically based on the name of the operator and a fixed set of rules. For example, the operator ~= will have a very low precedence based on a rule that says "an operator whose name ends in '=' will have the same precedence as the assignment operator '='". This design allows LES to be "context-free"; an LES parser is virtually stateless and does not need a "symbol table" of any kind. A person reading LES only needs to know the precedence rules in order to figure out the precedence of a new operator that he or she has never seen before.
  • Operators whose names do not start with # can be formed with the \ prefix. Such operators can contain alphanumerics and punctuation, e.g. \div and \++ are binary or prefix operators; a \div b is equivalent to div(a, b) and a \++ b is equivalent to @++(a, b). Again, if you need extra-special characters, enclose the operator in backquotes: \`/*\n*/`.
  • LES will have "super-expressions", which are multiple expressions juxtaposed together with no separator except a space. If a "super-expression" contains more than one expression, the first expression contains a call target and the other expressions are arguments. For example, a.b c + d e; parses as three separate expressions: a.b, c + d, and e. After gathering this expression list, the parser treats the final "tight" expression in the first expression as "calling" the others (explanation below). Thus, together, the three expressions will be shorthand for a.b(c + d, e);.*
I realize that "superexpressions" might be confusing at first, but I think this design allows a rather elegant way to write "expressions" as if they were "statements". For example, the following is a valid expression:
if a == null {
  b();
} else {
  c();
};
and it actually means
if(a == null, { b(); }, else, { c(); });
which, in turn, is shorthand for
if(@#==(a, null), @`#{}`( b(); ), else, @`#{}`( c(); ));
All three inputs are parsed into the same Loyc tree.

Please note that none of this has any "built-in" meaning to LES. LES doesn't care if there is an "else" or not, it doesn't care if you use "#if" or "if", LES doesn't give any built-in semantics to anything. LES is merely a syntax--a text representation of a Loyc tree.

So the answer to questions like "why is there no #else" or "why did you use '#if' in one place, but 'if' in another place" is, first of all, that LES doesn't define any rules. Programming languages define these kinds of rules, but LES is not a programming language. It is just a textual representation of a data structure.

It is easy enough for the parser to parse a series of expressions. Basically, whenever you see two identifiers side-by-side (or other "atoms" such as numbers or braced blocks), like foo bar, that's the boundary between two expressions in a superexpression. When the parser sees a superexpression, it must decide how to join the "tail" expressions to the "head" expression. For example, this superexpression:
a = b.c {x = y} z;
Consists of the expressions "a = b.c", "{x = y}" and "z". How will these expressions be combined? My plan is basically that the other expressions are added to the first expression as though you had written this instead:
a = b.c({x = y}, z);
So it's as if the first space character that separates the first and second expressions is replaced with '(', a ')' is added at the end, and commas are added between the other expressions. "b.c" is what I am calling a "tight expression" because the operators in the expression (e.g. ".") are tightly bound (high precedence), and the expression is normally written without space characters. Confused yet?

The point of a "superexpression" is that you can write statements that look as if they were built into the language, even though they are interpreted by a fixed-function parser that has no idea what they mean. For example, in LES you can write
try {
   file.Write("something");
} catch(e::Exception) {
   MessageBox.Show("EPIC I/O FAIL");
} finally {
   file.Dispose();
};
And the parser handles this without any clue what "try", "catch" and "finally" mean.

Superexpressions also allow you to embed statement-like constructs inside expressions, like this:
x = (y * if z > 0 z else 0) + 1;
and the LES parser will understand that "z > 0", "z", "else" and "0" are arguments in a call to "if".

A syntax highlighting engine for LES should highlight the "tight expression" to indicate how the superexpression is interpreted. This syntax highlighting rule would highlight words like "if" and "try" when they are used this way (but not "else" or "finally", which are merely arguments to "if" and "try"; the parser cannot know that they have any special meaning.)

It's not quite as simple as I implied, though. If the first expression already ends with a method call:
a = b.c(foo) {x = y} z;
I think it would be better to interpret that as
a = b.c(foo, {x = y}, z);
rather than
a = b.c(foo)({x = y}, z);
A motivation for this exception is explained later.

Only one superexpression can exist per subexpression. For example, an expression with two sets of parenthesis, ... (...) ... (...) ... can contain at most three superexpressions: one at the outer level and one in each set of parenthesis.

You cannot "nest superexpressions" without parenthesis. for example, an expression of the form if x if y a else b else c can be parsed as if(x, if, y, a, else, b, else, c), but it won't mean anything because the second "if" is merely one of 8 arguments to the first "if"; there is nothing to indicate that it is special.

Did you notice how this plan is space-sensitive?

In LES you normally separate expressions with a space. Operators--that is, sequences of punctuation--are assumed to be binary if possible and unary if not. So "x + y" and "- z" are understood to be single expressions with binary and unary operators, respectively. When you put two of these expressions side-by-side, such as "- z x + y", it is seen as two separate expressions. If you write "x + y - z", on the other hand, this is understood as a single expression because "-" is assumed to be a binary operator in the second case. "x + y (- z)" is parsed as two expressions again. The whitespace between operators and arguments doesn't matter much; "x+y (-z)" is also acceptable and means the same thing. But the space between the expressions is important; "x+y (-z)" is two separate expressions, while "x+y(-z)" is a single expression meaning "x + (y(-z))"; here, y is the target of a method call.

I know it's not ideal. But in my experience, most programmers write
 if (...) {...}
 for (...) {...}
 lock (...) {...}
 while (...) {...}
 switch (...) {...}
with a space between the keyword and the opening paren, while they write
 Console.WriteLine(...);
 var re = new Regex(...);
 F(x);
with no space between the identifier and the method being called. So in a language that has no real keywords, it seems reasonable, although not ideal, to use a spacing rule to identify whether the first word of an expression is to be treated as a "keyword" or not.

So the spacing rule exists to identify "keyword statements" in a language that has no keywords. LEL can parse anything that looks like a keyword statement:
 if ... {...};
 for ... {...};
 lock ... {...};
 while ... {...};
 switch ... {...};
 unless ... {...};
 match ... {...};
But notice that a semicolon is needed at the end of each statement to separate it from the following statement. No parenthesis are required around the "subject" (e.g. the "c" in "if c"), but if parenthesis are present, the space character will ensure that the statement is still understood correctly. Statements like the above are automatically translated into an ordinary call,
 if(..., {...});
 for(..., {...});
 lock(..., {...});
 while(..., {...});
 switch(..., {...});
 unless(..., {...});
 match(..., {...});
There is another way that LES is whitespace-sensitive. Since LES contains an infinite number of operators, two adjacent operators must have spaces between them. For example, you cannot write x * -1 as x*-1 because *- will be seen as a single operator named #*-.

Attributes are expressed in LES with the square brackets:
[Flags] enum Emotion { 
   Happy = 1; Sad = 2; Angry = 4; Horndog = 65536;
};
There are some more details I could mention about LES, but you get the basic idea.

Meet LEL: the exact same language

LEL ("Loyc Expression Language") will be a programming language built on top of LES ("Loyc Expression Syntax"). It will use exactly the same parser as LES; in fact we can say that it "uses" LES, because LES consists of a parser and nothing else. LEL will then give meaning to the syntax. For example, in LES the word "if" is just an identifier, it has no built-in meaning. But in LEL, "if" will be defined as a standard macro that creates a standard "#if" expression, which will be part of some sort of common language that EC# and LEL will both use internally.

So in LEL,
if a > 0 {
  b();
} else {
  c();
};
means just what you would expect: call b() if a>0, otherwise call b(). However, LEL will not be a typical programming language. "if" will not be a built-in statement like it is in C#, for example. Instead, LEL will be a macro-heavy language, and "if" will be a standard macro. A macro, if you didn't know, is a special kind of method that runs at compile-time. Of course, LEL macros will be unrelated to C/C++ macros, which give the word "macro" a bad name. Like C/C++ macros, it is quite possible to abuse LEL macros; but unlike C/C++ macros, LEL macros will be fantastically powerful.

I don't have the details worked out about the macro system yet; I'll be sure to read more about Nemerle for inspiration.

The semantics of LEL will be based on EC#, and at first it will compile down to plain C# so it will not be able to surpass the runtime capabilities of C#, although it will still allow crazy-named identifiers by mapping punctuation to letters; for example @Save&Close will be represented in C# as something like Save_and_Close.

Things like classes and using statements are not hard to support with LES superexpressions, and I'm thinking about using :: for variable declarations:
using System.Collections.Generic;
[pub] class Point(ICloneable.[Point])
{
   [pub] X::int; // pub for public, see?
   [pub] Y::int;
};
I'm not an abbreviation freak, but [public] requires brackets around it, which is unfortunate, so I'll compensate for this extra syntactic noise by letting you abbreviate the attributes: public to pub, private to priv, static to stat, virtual to virt, and so on. Also, as in C#, you can separate attributes with a comma or use multiple bracket groups: [pub,stat] and [pub][stat] are equivalent.

To allow :: for variable declarations, I'm giving :: a precedence below dot (.) even though :: has the same precedence as dot in C# and C++. This allows s::System.String to parse as s::(System.String). (You might wonder why I don't just use a single colon : for this; well, for now I'll keep it as my little secret.)

".[...]" is the syntax for generic type arguments; ICloneable.[Point] has the interpretation #of(ICloneable, Point) which means ICloneable<Point> in C#. I could consider using the D syntax instead (ICloneable!Point), it's not set in stone. However, the C++/Java/C# syntax ICloneable<Point> is out of the question. It would be hard to parse and it makes the "superexpression" concept highly ambiguous.

It may seem possible to support C-style variable declarations like
String x;
But it really isn't; String x just means String(x)--it's an ordinary function call.

The method declaration syntax is a bit of a head-scratcher. I'm thinking about using this syntax:
[pub] Square(x::int)::int { x * x };
which is parsed as
[pub] (Square(x::int)::int)({ x * x });
[pub] @#::(Square(x::int), int)({ x * x }); // equivalent
[pub] @#::(Square(@#::(x, int)), int)(@`#{}`(x * x)); // equivalent
I'm just not sure how to write the specification for LEL in such a way that the compiler can make sense out of this peculiar representation. I'd also, just this once, like to lift the restriction against spaces between the method name and its arguments, so that you can write
@pub Square (x::int)::int { x * x };
but this form has the rather different syntax tree
@pub Square((x::int)::int, { x * x });
And this would be essentially indistinguishable from a call to a macro called "Square", so I'm a bit stumped. Hmm...

Let's talk about this weird spacing rule again, because you might wonder what happens if the spacing is incorrect. Obviously, the mistakes will be (1) to forget a space before '(', and (2) to use a space before '(' where there should not be one.

First consider the following example:
if(c) { a(); } else { b(); };
This will actually still work most of the time, because if there is already an argument list on the end of the first expression, the other expressions are added to that existing argument list (as I mentioned earlier). So this example is interpreted the same way as the correct version,
if (c) { a(); } else { b(); };
which means
if((c), { a(); }, else, { b(); });
But there is another version of this mistake that looks like this:
if(x+y) > z { a(); } else { b(); };
And this can be parsed as
if(x+y) > z({ a(); }, else, { b(); });
The 'if' macro itself could produce a somewhat understandable error message like "'if' expects 2 or 3 arguments but received only one. It looks like you forgot a space after 'if'." This would be followed by a complaint about there not being any method 'z' that takes 3 arguments.

The second mistake, adding a space when there should not be one, has different effects depending on the situation. In very simple cases like this one,
foo (x);
The interpretation is foo((x)); so it works. But in other cases, such as
x = foo (x) + 1;
a = foo (b, c);
The interpretations are different:
x = foo((x) + 1);
a = foo((b, c));
In the second case, (b, c) is parsed as a tuple (I haven't decided on the representation for tuples in LES yet), that is passed as a single parameter to foo().

Unfortunately, the code can still be parsed, but changes meaning. The primary defense against both of these mistakes is syntax highlighting: foo will show up in a different color when it is followed by a space than when it is not. I only hope that this distinction will be enough.

Assuming that foo() is a method, I suppose that LEL could prefer the "superexpression" syntax for macros, and issue a warning if a superexpression is used to call an ordinary method (the parser records the fact that "superexpression" syntax was used in the NodeStyle).

My current thinking is that the standard macros of LEL will essentially convert LEL into "built-in" constructs which are then compiled, and these constructs will be identical or very similar to EC# constructs, so that LEL and EC# can share the same back-end. With this design, LEL can easily be converted to C#, since I am planning to write an EC#-to-C# compiler. So the "if" macro, for example, will take input such as
if (a) b else c;
and produce output such as
#if (a) b c;
or in other words,
#if((a), b, c);

And now back to LLLPG

Remember, besides designing two new programming languages (LEL and EC#) I'm also writing a parser generator that will be used to describe both of these languages. After writing an LES parser, I will use it to describe the full grammar of EC#.

Please note that I already wrote a complete node printer for the EC# language, and the parser generator uses this to generate plain C# code (EC# is a superset of C#, so the parser generator simply generates a Loyc tree representing C# code, and prints it out using the EC# node printer.)

Right now, I can define a grammar in plain C# like this:
Rule Int = Rule("Int", Plus(Set("[0-9]")), Token);
// represents the grammar '0'..'9'+, which in LES might be written
//    Int @[ '0'..'9'+ ];
// Here, @[...] is a "token literal". It is tokenized by LES, but
// otherwise uninterpreted. This allows the LLLPG macro that is 
// parsing the grammar to interpret it instead.
and then I can generate C# code like this:
LLLPG.AddRule(Int);
LNode result = LLLPG.GenerateCode();
string CSharpCode = result.Print();
Where "LNode" is a "Loyc node", the data type of a Loyc tree. Here is the generated code:
{
  private void Int()
  {
    int la0;
    MatchRange('0', '9');
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
  }
}
My end goal is to let you write a grammar in EC# syntax, which will look something like this:
using System;
using System.Collections.Generic;

[[LLLPG(IntStream)]]
class MyLexer {
  private Number ==> @[ 
    &('0'..'9'|'.')
    '0'..'9'* ('.' '0'..'9'+)?
  ];
  private Id      ==> @[ IdStart IdCont* ];
  private IdStart ==> @[ 'a'..'z'|'A'..'Z'|'_' ];
  private IdCont  ==> @[ IdStart|'0'..'9' ];
  private Spaces  ==> @[ ' '|'\t'|'\n'|'\r' ];
  public  Token   ==> @[ Id | Spaces | Number ];

  public void NormalMethod()
  {
    NormalCSharpCode();
  }
}
"==>" is the new EC# "forwarding operator". Never mind what that means, because in this case I have hijacked it for expressing LLLPG rules. Again, @[...] is a token literal that LLLPG will parse. "IntStream" refers to a helper class for code generation; it means that the input is a series of integers (actually characters, plus -1 to represent EOF).

So how do I get from here to there? Here's my plan for the bootstrapping process.
  1. First, I'll write a grammar for LES in C#, using LLLPG for code generation.
  2. Next, I'll write a skeleton version of LEL that can run simple non-hygienic macros (hygienic macros I can add later, when more infrastructure exists).
  3. One of the macros will gather input for LLLPG. Keep in mind that at this point there is no parser for the token literals--no parser that can understand input like "0x" ('0'..'9'|'a'..'f')+. So instead I will use LES as an intermediate language. LES supports suffix operators that start with "\\", so I could express a grammar with a syntax like this:
    LLLPG' IntStream (
      [pub] class MyLexer() {
        [priv] rule Number
          &('0'::'9'|'.') +
          '0'::'9'\\* + ('.' + '0'::'9'\\+)\\?;
        [priv] rule Id      IdStart IdCont\\*;
        [priv] rule IdStart 'a'::'z'|'A'::'Z'|'_';
        [priv] rule IdCont  IdStart|'0'::'9';
        [priv] rule Spaces  ' '|'\t'|'\n'|'\r';
        [pub]  rule Token   Id | Spaces | Number;
      }
    });
    
    Here I have changed '..' to '::' because '::' has higher precedence than '|' and '+' ('..' has lower precedence than both of them). The backslashes "\\" inform the LES parser that the "\\*" and "\\?" operators are suffix operators, not binary operators. This syntax isn't entirely comfortable, but it's better than the C# representation.
  4. The LLLPG macro will take the Loyc tree that came from the LES parser and translate it into a series of "Rule" objects. These objects are the "real" input to LLLPG's core engine. This conversion process will be written in plain C#.
  5. Next, I'll use this syntax to write a parser for the token literals. This parser will simply translate the token literals into the form you see above, so
    LLLPG IntStream (
      [pub] class MyLexer() {
        [priv] Number @[ 
          &('0'..'9'|'.') 
          '0'..'9'* ('.' '0'..'9'+)?
        ];
        [priv] Id      @[ IdStart IdCont* ];
        [priv] IdStart @[ 'a'..'z'|'A'..'Z'|'_' ];
        [priv] IdCont  @[ IdStart|'0'..'9' ];
        [priv] Spaces  @[ ' '|'\t'|'\n'|'\r' ];
        [pub]  Token   @[ Id | Spaces | Number ];
      }
    });
    
    Is translated to the form above. Thus, translating from LES to LLLPG Rules is a two-stage process with two separate interpreters. The reason for doing it this way is to allow third-party macros to run in between the two stages. My current thinking is that LEL could contain "high priority" and "low priority" lexical macros. First the high-priority macros run, then the low-priority ones run on a separate pass.

    Notice that I used "LLLPG'" in the first code listing but "LLLPG" in the second listing. LLLPG will be the high-priority macro and LLLPG' the low-priority second stage. This allows other lexical macros to process the code after the first stage, which will help end-users to do fancy things like generating grammars at compile-time.

    To tell you the truth, I have never seriously programmed in a language with a LISP-style macro system, so my approach may be flawed. I guess we'll find out.
  6. With both LLLPG fully implemented in LES, I can finally write a parser for the entire EC# language, in LES. (and remember, the LES code is translated to C# for compilation.) Once that's done, LLLPG can also support EC# as a host language with minimal changes. Bootstrapping complete!
I hereby open the floor to your comments and questions.

< Published on >

No comments: