Tuesday, December 17, 2013

Bogus ambiguity warnings in LLLPG

Sometimes LLLPG gives me a warning that I just can't fathom. I'm currently writing the parser for Enhanced C#, and I got a very puzzling warning:
using TT = TokenType;

partial class EcsParser {
[DefaultK(3)]
LLLPG parser(laType(TokenType), matchType(int), setType(HashSet!int))
{
   alias("(" = TT.LParen);
   alias(")" = TT.RParen);
   alias("[" = TT.LBrack);
   alias("]" = TT.RBrack);
   alias("." = TT.Dot);
   alias("!" = TT.Not);
   alias("<" = TT.LT);
   alias(">" = TT.GT);
   alias("," = TT.Comma);

   // Complex identifier, e.g. accepts x.y or x but rejects x
   token ComplexId() @[
      IdAtom 
      RestOfId()
   ];
   rule IdAtom @[
      (TT.Id|TT.TypeKeyword)
   ];
   rule RestOfId() @[
      // Warning: Optional branch is ambiguous for input such as «TT.Dot TT.Id _»
      // ((TT.Dot|TT.Not|TT.LT), (TT.Id|TT.TypeKeyword|TT.LParen|TT.LBrack|TT.GT), ~()) 
      TParams()?
      DotRestOfId()?
   ];
   rule DotRestOfId() @[
      "." IdAtom
      RestOfId()
   ];
   rule TParams() @[ // type parameters
      ( "<" (ComplexId ("," ComplexId)*)? ">"
      // Also allow Nemerle-style type parameters (e.g. Dictionary.[int,string])
      | "." "[" "]"
      // D-style type parameters (Dictionary!(int, string))
      //| "!" "(" ")"
      )
   ];
}};
I actually started with a 600-line parser and slowly reduced it to this minimal test case. In fact the grammar is unambiguous, so why does this warning occur? (and by the way, LLLPG's example case varies depending on small details of the grammar; at first it always said «TT.LT TT.GT _» but would give some other example if the aliases were removed.) If you can't figure it out, I don't blame you; I actually attached a debugger to see LLLPG's internal state at the time of the warning, and I couldn't figure it out either.

The generated code is wrong for RestOfId(), but correct for the other rules. How is it wrong?

Notice that the code for RestOfId(), as written, accepts ".>" or "<[" as a way of starting TParams(); this isn't really why the rule is wrong though, because ".>" or "<[" are invalid inputs, so it isn't super important which path is taken for those cases. This illustrates what LLLPG is "thinking" though: it has "mixed together" the two arms of TParams, "<" (ComplexId ("," ComplexId)*)? ">" and "." "[" "]". If it sees "<" (from arm 1 of TParams) and then "[" (from arm 2 of TParams), it will call TParams.
void RestOfId()
{
   TokenType la0, la1;
   la0 = LA0;
   if (la0 == TT.Dot || la0 == TT.LT) {
      switch (LA(1)) {
      case TT.LBrack:
      case TT.GT:
      case TT.Id:
      case TT.TypeKeyword:
         TParams();
         break;
      }
   }
   la0 = LA0;
   if (la0 == TT.Dot) {
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword)
         DotRestOfId();
   }
}
The real problem happens when it sees "." (from arm 2 of TParams) followed by TT.Id (from arm 1 of TParams). In that case RestOfId() should skip TParams() and call DotRestOfId instead. But no, it calls TParams.

This explains why the warning occurs. If, inside RestOfId, LLLPG does not separate the logic for the two arms of TParams, LLLPG gets confused and thinks that "." followed by an identifier is a valid input for TParams as well as DotRestOfId. Therefore it can't decide which rule to call, prints a warning, and (because it is greedy by default) calls TParams.

This is exactly the same sort of problem that you used to get in ANTLR 2 with ANTLR's linear approximate lookahead. Although Terrance Parr (author of ANTLR) clearly documented how it works and what it means, users (even me!) find its behavior unexpected and confusing.

By default, LLLPG's lookahead is more powerful than linear approximate lookahead; the example given by Terrance, (A B | C D) | A D, is no problem for LLLPG. I could go into some detail about how my lookahead differs from ANTLR 2, but I think most readers just want to know "how do I fix this" and the answer is simple: add the [FullLLk] attribute, either just to the rule that is giving you trouble, or the entire grammar (the "LLLPG" statement):
partial class EcsParser {
[DefaultK(3)]
LLLPG parser(laType(TokenType), matchType(int), setType(HashSet!int))
{
   ...
   [FullLLk] // eliminates the bogus warning, and fixes the generated code
   rule RestOfId() @[
      TParams()?
      DotRestOfId()?
   ];
   ...
}};
The FullLLk option enables "true" LL(k) prediction, and forces LLLPG to analyze sub-decisions independently. So the decision of whether to call TParams requires not one, but two separate sequences of tests, for the two branches inside TParams:
void RestOfId()
{
   TokenType la0, la1, la2;
   la0 = LA0;
   if (la0 == TT.LT) { // branch 1 of TParams
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword) {
         switch (LA(2)) {
         case TT.GT:
         case TT.Comma:
         case TT.Dot:
         case TT.LT:
            TParams();
            break;
         }
      } else if (la1 == TT.GT)
         TParams();
   } else if (la0 == TT.Dot) { // branch 2 of TParams
      la1 = LA(1);
      if (la1 == TT.LBrack) {
         la2 = LA(2);
         if (la2 == TT.RBrack)
            TParams();
      }
   }
   la0 = LA0;
   if (la0 == TT.Dot) {
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword)
         DotRestOfId();
   }
}
Full LL(k) mode doesn't always work perfectly, and may make LLLPG run slower, which is why it is not enabled by default. But usually, it works fine. If you still get the same ambiguity warning after enabling Full LL(k), check over your grammar carefully, because it is probably genuinely ambiguous.

Update: By the way, I later noticed that ComplexId, RestOfId and DotRestOfId can be combined into a single rule:
[FullLLk]
token ComplexId() @[
    IdAtom
    ( "::" IdAtom )? // omitted from simplified grammar
    TParams()?
    (   "." IdAtom
        TParams()?
    )*
];

Saturday, November 23, 2013

The Loyc LL(k) Parser Generator: part 2

This article is Published on CodeProject.

This article assumes you've read part 1. LLLPG still requires you to write your code in LES rather than C#, but LES is a little friendlier now that I've written a basic syntax-highlighting extension for Visual Studio 2013. Plus, I wrote a single-file generator, so it's more convenient to use LLLPG in your Visual Studio projects. More on that later.

LLLPG-0.91.zip contains LLLPG 0.91, the single-file generator (run LLLPG\LllpgForVisualStudio.exe to install) and the "Calculator" sample project from the first article. The sample project is unchanged, except that it uses the single-file generator to generate code. The sample project can be opened in VS2010 (any edition), and the single-file generator supports any Visual Studio from 2008 to 2013, including Express editions. I just noticed that some machines (and not others), require LllpgForVisualStudio.exe to be run in Administrator mode (in Windows 7, right-click LllpgForVisualStudio.exe and click "Run as administrator"). Otherwise the "Register" button gives an error about permissions.

LesSyntaxHighlightingForVS2013.vsix is the syntax highlighter. Note that although it offers to install in VS 2013 Express, Microsoft won't allow it to work in Express. It also offers to intall in VS 2012 just in case it might work, but I don't think it will.

Table of contents for today:
  1. Do you really need a parser generator?
  2. Parsing terminology: if you already know about stuff like terminals and nullability, skip this.
  3. LL(k) versus the competition: in part one you got a sense of what LL(k) means (choosing a path through a grammar based only on k tokens of lookahead, k >= 1). This time I will compare LL(k) with alternatives such as LALR(1) and PEGs. If you already know or don't care about alternative parsing models, skip this.
  4. Learning LLLPG, with numbers: this section introduces syntactic predicates, semantic predicates, gates, underscore and tilde, and the [k] attribute.
  5. Wrapping Up.

Do you really need a parser generator?

One of the most common introductory examples for any parser generator is an expression parser or calculator, like this calculator bundled with the previous article:
LLLPG parser(laType(int))
{
 rule Atom()::double @[
  { result::double; }
  ( t:=id           { result = Vars[t.Value -> string]; }
  | t:=num          { result = t.Value -> double; } 
  | '-' result=Atom { result = -result; }
  | '(' result=Expr ')'
  | error           { result = 0; Error(InputPosition, "Expected identifer, number, or (stuff)"); }
  )
  { return result; }
 ];
 rule MulExpr()::double @[ 
  result:=Atom
  (op:=(mul|div) rhs:=Atom { result = Do(result, op, rhs); })*
  { return result; }
 ];
 rule AddExpr()::double @[
  result:=MulExpr
  (op:=(add|sub) rhs:=MulExpr { result = Do(result, op, rhs); })*
  { return result; }
 ];
 rule Expr()::double @[
  { result::double; }
  ( t:=id set result=Expr { Vars[t.Value.ToString()] = result; }
  | result=AddExpr )
  { return result; }
 ];
};

def Do(left::double, op::Token, right::double)::double
{
 switch op.Type {
  case add; return left + right;
  case sub; return left - right;
  case mul; return left * right;
  case div; return left / right;
 };
 return double.NaN; // unreachable
};
But if expression parsing is all you need, you don't really need a parser generator; there are simpler options for parsing, such as using a Pratt Parser like this one. If you only need to parse simple text fields like phone numbers, you can use regular expressions. And even if you need an entire programming language, you don't necessarily need to create your own; for example you could re-use the LES parser included in Loyc.Syntax.dll, which comes with LLLPG (it's not 100% complete yet, but it's usable enough for LLLPG), and if you don't re-use the parser you might still want to re-use the lexer.

So before you go writing a parser, especially if it's for something important rather than "for fun", seriously consider whether an existing parser would be good enough. Let me know if you have any questions about LES, Loyc trees, or using the Loyc libraries (Loyc.Essentials.dll, Loyc.Collections.dll, and so on).

Parsing terminology

I'll start with a short glossary of standard parsing terminology.

First of all, grammars consist of terminals and nonterminals.

A terminal is an item from the input; when you are defining a lexer, a terminal is a single character, and when you are defining a parser, a terminal is a token from the lexer. More specifically, the grammar is concerned only with the type of the token, not its value. For example one of your token types might be Number, and a parser cannot treat a particular number specially; so if you ever need to treat the number "0" differently than other numbers, then your lexer would have to create a special token type for that number (e.g. Zero), because the grammar cannot make decisions based on a token's value.

A nonterminal is a rule in the grammar. So in this grammar:
 token Spaces @[ (' '|'\t')+ ];
 token Id @[
  ('a'..'z'|'A'..'Z'|'_')
  ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
 ];
 token Int @[ '0'..'9'+ ];
 token Token @[ Spaces | Id | Int ];
The nonterminals are Spaces, Id, Int and Token, while the terminals are inputs like 's', '\t', '9', and so forth.

Traditional literature about parsing assumes that there is a single "Start Rule" that reprsents the entire language. For example, if you want to parse C#, then the start rule would say something like "zero or more using statements, followed by zero or more namespaces and/or classes":
  rule Start @[ UsingStmt* TopLevelDecl* ];
  rule TopLevelDecl @[ NamespaceStmt | TypeDecl ];
  ...
However, LLLPG is more flexible than that. It doesn't limit you to one start rule; instead, LLLPG assumes you might start with any rule that isn't marked "private". After all, you might want to parse just a subset of the language, such as a method declaration, a single statement, or a single expression.

LLLPG rules express grammars in a somewhat non-standard way. Firstly, LLLPG notation is based on ANTLR notation, which itself is mildly different from EBNF (Extended Backus–Naur Form) notation which, in some circles, is considered more standard. Also, because LLLPG is embedded inside another programming language (LES or EC#), it uses the @[...] notation which means "token literal". A token literal is a sequence of tokens that is not interpreted by the host language; the host language merely figures out the boundaries of the "token literal" and saves the tokens so that LLLPG can use them later. So while a rule in EBNF might look like this:
 Atom = ['-'], (num | id);
the same rule in LLLPG-in-LES looks like this:
 rule Atom @[ ('-')? (num | id) ];
In EBNF, [foo] means foo is optional, while LLLPG uses foo? to express the same idea. Some people use {foo} to represent a list of zero or more foos, while LLLPG uses foo* for the same thing. On the plus side, alternatives and grouping work the same way in LLLPG and EBNF (e.g. (a | b)), although LLLPG also has the notation (a / b) which I'll explain later.

In addition, some grammar representations do not allow loops, or even optional items. For example, formal "four-tuple" grammars are defined this way. I discuss this further in my blog post, Grammars: theory vs practice.

A language is a different concept than a grammar. A grammar represents some kind of language, but generally there are many possible grammars that could represent the same language. The word "language" refers to the set of sentences that are considered valid; two different grammars represent the same language if they accept or reject the same input (or, looking at the matter in reverse, if you can generate the same set of "sentences" from both grammars). For example, the following four rules all represent a list of digits:
 rule Digits1 @[ '0'..'9'+ ];
 rule Digits2 @[ '0'..'9'* '0'..'9' ];
 rule Digits3 @[ '0'..'9' | Digits3 '0'..'9' ];
 rule Digits4 @[ ('0'..'9'+)* ];
If we consider each rule to be a separate grammar, the four grammars all represent the same language (a list of digits). But the grammars are of different types: Digits1 is an LL(1) grammar, Digits2 is LL(2), Digits3 is LALR(1), and I don't know what the heck to call Digits4 (it's highly ambiguous, and weird). Since LLLPG is an LL(k) parser generator, it supports the first two grammars, but can't handle the other two; it will print warnings about "ambiguity" for Digits3 and Digits4, then generate code that doesn't work properly. Actually, while Digits4 is truly ambiguous, Digits3 is actually unambiguous. However, Digits3 is ambiguous "in LL(k)", meaning that it is ambiguous from the LL(k) perspective, i.e. from LLLPG's perspective.

The word "nullable" means "can match nothing". For example, @[ '0'..'9'* ] is nullable because it successfully "matches" an input like "hello, world" by doing nothing; but @[ '0'..'9'+ ] is not nullable, and will only match something that starts with at least one digit.

LL(k) versus the competition

LLLPG is in the LL(k) family of parser generators. It is suitable for writing both lexers (also known as tokenizers or scanners) and parsers, but not for writing one-stage parsers that combine lexing and parsing into one step. It is more powerful than LL(1) parser generators such as Coco/R.

LL(k) parsers, both generated and hand-written, are very popular. Personally, I like the LL(k) class of parsers because I find them intuitive, and they are intuitive because when you write a parser by hand, it is natural to end up with a mostly-LL(k) parser. But there are two other popular families of parser generators for computer languages, based on LALR(1) and PEGs:
  • LALR(1) parsers are simplified LR(1) parsers which--well, it would take a lot of space to explain what LALR(1) is. Suffice it to say that LALR(1) parsers support left-recursion and use table-based lookups that are impractical to write by hand, i.e. a parser generator or similar infrastructure is always required, unlike LL(k) grammars which are straightforward to write by hand and therefore straightforward to understand. LALR(1) supports neither a superset nor a subset of LL(k) grammars; While I'm familiar with LALR(1), I admittedly don't know anything about its more general cousin LR(k), and I am not aware of any parser generators for LR(k) or projects using LR(k). Regardless of merit, LR(k) for k>1 just isn't very popular. I believe the main advantage of the LR family of parsers is that LR parsers support left- recursion while LL parsers do not (more on that later).
  • PEGs are recursive-descent parsers with syntactic predicates, so PEG grammars are potentially very similar to grammars that you would use with LLLPG. However, PEG parsers traditionally do not use prediction (although prediction could probably be added as an optimization), which means that they don't try to figure out in advance what the input means; for example, if the input is "42", a PEG parser does not look at the '4' and decide "oh, that looks like a number, so I'll call the number sub-parser". Instead, a PEG parser simply "tries out" each option starting with the first (is it spaces? is it an identifier?), until one of them successfully parses the input. Without a prediction step, PEG parsers apparently require memoization for efficiency (that is, a memory of failed and successful matches at each input character, to avoid repeating the same work over and over in different contexts). PEGs typically combine lexing (tokenization) and parsing into a single grammar (although it is not required), while other kinds of parsers separate lexing and parsing into independent stages.
Of course, I should also mention regular expressions, which probably the most popular parsing tool of all. However, while you can use regular expressions for simple parsing tasks, they are worthless for "full-scale" parsing, such as parsing an entire source file. The reason for this is that regular expressions do not support recursion; for example, the following rule is impossible to represent with a regular expression:
 rule PairsOfParens @[ '(' PairsOfParens? ')' ]; 
Because of this limitation, I don't think of regexes as a "serious" parsing tool.

Other kinds of parser generators also exist, but are less popular. Note that I'm only talking about parsers for computer languages; Natural Language Processing (e.g. to parse English) typically relies on different kinds of parsers that can handle ambiguity in more flexible ways. LLLPG is not really suitable for NLP.

As I was saying, the main difference between LL(k) and its closest cousin, the PEG, is that LL(k) parsers use prediction and LL(k) grammars usually suffer from ambiguity, while PEGs do not use prediction and the definition of PEGs pretends that ambiguity does not exist because it has a clearly-defined system of prioritization.

"Prediction" means figuring out which branch to take before it is taken. In a "plain" LL(k) parser (without and-predicates), the parser makes a decision and "never looks back". For example, when parsing the following LL(1) grammar:
public rule Tokens @[ Token* ];
public rule Token  @[ Float | Id | ' ' ];
token Float        @[ '0'..'9'* '.' '0'..'9'+ ];
token Id           @[ IdStart IdCont* ];
rule  IdStart      @[ 'a'..'z' | 'A'..'Z' | '_' ];
rule  IdCont       @[ IdStart | '0'..'9' ];
The Token method will get the next input character (known as LA0 or lookahead zero), check if it is a digit or '.', and call Float if so or Id (or consume a space) otherwise. If the input is something like "42", which does not match the definition of Float, the problem will be detected by the Float method, not by Token, and the parser cannot back up and try something else. If you add a new Int rule:
...
public rule Token @[ Float | Int | Id ];
token Float       @[ '0'..'9'* '.' '0'..'9'+ ];
token Int         @[ '0'..'9'+ ];
token Id          @[ IdStart IdCont* ];
...
Now you have a problem, because the parser potentially requires infinite lookahead to distinguish between Float and Int. By default, LLLPG uses LL(2), meaning it allows at most two characters of lookahead. With two characters of lookahead, it is possible to tell that input like "1.5" is Float, but it is not possible to tell whether "42" is a Float or an Int without looking at the third character. Thus, this grammar is ambiguous in LL(2), even though it is unambiguous when you have infinite lookahead. The parser will handle single-digit integers fine, but given a two-digit integer it will call Float and then produce an error because the expected '.' was missing.

A PEG parser does not have this problem; it will "try out" Float first and if that fails, the parser backs up and tries Int next. There's a performance tradeoff, though, as the input will be scanned twice for the two rules.

Although LLLPG is designed to parse LL(k) grammars, it handles ambiguity similarly to a PEG: if A|B is ambiguous, the parser will choose A by default because it came first, but it will also warn you about the ambiguity.

Since the number of leading digits is unlimited, LLLPG will consider this grammar ambiguous no matter how high your maximum lookahead k (as in LL(k)) is. You can resolve the conflict by combining Float and Int into a single rule:
public rule Tokens @[ Token* ];
public rule Token  @[ Number | Id ];
token Number       @[ '.' '0'..'9'+
                    | '0'..'9'+ ('.' '0'..'9'+)? ];
token Id           @[ IdStart IdCont* ];
...
Unfortunately, it's a little tricky sometimes to merge rules correctly. In this case, the problem is that Int always starts with a digit but Float does not. My solution here was to separate out the case of "no leading digits" into a separate "alternative" from the "has leading digits" case. There are a few other solutions you could use, which I'll discuss later in this article.

I mentioned that PEGs can combine lexing and parsing in a single grammar because they effectively support unlimited lookahead. To demonstrate why LL(k) parsers usually can't combine lexing and parsing, imagine that you want to parse a program that supports variable assignments like "x = 0" and function calls like x(0), something like this:
// "Id" means identifier, LParen means left parenthesis '(', etc.
rule Expr    @[ Assign | Call | ... ];
rule Assign  @[ Id Equals Expr ];
rule Call    @[ Id LParen ArgList ];
rule ArgList ...
...
If the input is received in the form of tokens, then this grammar only requires LL(2): the Expr parser just has to look at the second token to find out whether it is Equals ('=') or LParen ('(') to decide whether to call Assign or Call. However, if the input is received in the form of characters, no amount of lookahead is enough! The input could be something like
this_name_is_31_characters_long = 42;
To parse this directly from characters, 33 characters of lookahead would be required (LL(33)), and of course, in principle, there is no limit to the amount of lookahead. Besides, LLLPG is designed for small amounts of lookahead like LL(2) or maybe LL(4); a double-digit value is almost always a mistake. LL(33) could produce a ridiculously large and inefficient parser (I'm too afraid to even try it.)

In summary, LL(k) parsers are not as flexible as PEG parsers, because they are normally limited to k characters or tokens of lookahead, and k is usually small. PEGs, in contrast, can always "back up" and try another alternative when parsing fails. LLLPG makes up for this problem with "syntactic predicates", which allow unlimited lookahead, but you must insert them yourself, so there is slightly more work involved and you have to pay some attention to the lookahead issue. In exchange for this extra effort, though, your parsers are likely to have better performance, because you are explicitly aware of when you are doing something expensive (large lookahead). I say "likely" because I haven't been able to find any benchmarks comparing LL(k) parsers to PEG parsers, but I've heard rumors that PEGs are slower, and intuitively it seems to me that the memoization and retrying required by PEGs must have some cost, it can't be free. Prediction is not free either, but since lookahead has a strict limit, the costs usually don't get very high. Please note, however, that using syntactic predicates excessively could certainly cause an LLLPG parser to be slower than a PEG parser, especially given that LLLPG does not use memoization.

Let's talk briefly about LL(k) versus LALR(1). Sadly, my memory of LALR(1) has faded since university, and while Googling around, I didn't find any explanations of LALR(1) that I found really satisfying. Basically, LALR(1) is neither "better" nor "worse" than LL(k), it's just different. As wikipedia says:
The LALR(k) parsers are incomparable with LL(k) parsers – for any j and k both greater than 0, there are LALR(j) grammars that are not LL(k) grammars and conversely. In fact, it is undecidable whether a given LL(1) grammar is LALR(k) for any k >= 0.
But I have a sense that most LALR parser generators in widespread use support only LALR(1), not LALR(k); thus, roughly speaking, LLLPG is more powerful than an LALR(1) parser generator because it can use multiple lookahead tokens. However, since LALR(k) and LL(k) are incompatible, you would have to alter an LALR(1) grammar to work in an LL(k) parser generator and vice versa.

Here are some key points about the three classes of parser generators.

LL(k):
  • Straightforward to learn: just look at the generated code, you can see how it works directly.
  • Straightforward to add custom behavior with action code ({...}).
  • Straightforward default error handling. When unexpected input arrives, an LL(k) parser can always report what input it was expecting at the point of failure.
  • Predictably good performance. Performance depends on how many lookahead characters are actually needed, not by the maximum k specified in the grammar definition. Performance of LL(k) parsers can be harmed by nesting rules too deeply, since each rule requires a method call (and often a prediction step); specifically, expression grammars often have this problem. But in LLLPG you can use a trick to handle large expression grammars without deep nesting (I suppose I'll talk about this in a future article; my technique is demonstrated in my LES grammar.)
  • LL(k) parser generators, including LLLPG, may support valuable extra features outside the realm of LL(k) grammars, such as zero-width assertions (a.k.a. and-predicates), and multiple techniques for dealing with ambiguity.
  • No left recursion allowed (direct or indirect)
  • Strictly limited lookahead. It's easy to write grammars that require unlimited lookahead, so such grammars will have to be modified to work in LL(k). By the way, ANTLR overcomes this problem with techniques such as LL(*).
  • Low memory requirements.
  • Usually requires a separate parser and lexer.
LALR(1):
  • Challenging to learn if you can't find a good tutorial; the generated code may be impossible to follow
  • Supports left recursion (direct and indirect)
  • Some people feel that LALR(1) grammars are more elegant. For example, consider a rule "List" that represents a comma-separated list of numbers. The LALR(1) form of this rule would be num | List ',' num, while the LL(1) form of the rule is num (',' num)*. Which is better? You decide.
  • Low memory requirements.
  • Usually requires a separate parser and lexer.
  • LALR parser generators often have explicit mechanisms for specifying operator precedence.
PEGs:
  • New! (formalized in 2004)
  • Unlimited lookahead
  • "No ambiguity" (wink, wink) - or rather, any ambiguity is simply ignored, "first rule wins"
  • Supports zero-width assertions as a standard feature
  • Grammars are composable. It is easier to merge different PEG parsers than different LL/LR parsers.
  • Performance characteristics are not well-documented (as far as I've seen), but my intuition tells me to expect a naive memoization-based PEG parser generator to produce generally slow parsers. That said, I think all three parser types offer roughly O(N) performance to parse a file of size N.
  • High memory requirements (due to memoization).
  • It's non-obvious how to support "custom actions". Although a rule may parse successfully, its caller can fail (and often does), so I guess any work that a rule performs must be transactional, i.e. it must be possible to undo the action.
  • Supports unified grammars: a parser and lexer in a single grammar.
Regexes:
  • Very short, but often very cryptic, syntax.
  • Matches characters directly; not usable for token-based parsing.
  • Incapable of parsing languages of any significant size, because they do not support recursion or multiple "rules".
  • Most regex libraries have special shorthand directives like \b that require significantly more code to express when using a parser generator.
  • Regexes are traditionally interpreted, but may be compiled. Although regexes do a kind of parsing, regex engines are not called "parser generators" even if they generate code.
  • Regexes are closely related to DFAs and NFAs.
Let me know if I missed anything.

Learning LLLPG, with numbers

So you still want to use LLLPG, even knowing about the competition? Phew, what a relief! Okay, let's parse some numbers. Remember that code from earlier?
public rule Tokens @[ Token* ];
public rule Token  @[ Float | Id | ' ' ];
token Float        @[ '0'..'9'* '.' '0'..'9'+ ];
token Int          @[ '0'..'9'+ ];
token Id           @[ IdStart IdCont* ];
rule  IdStart      @[ 'a'..'z' | 'A'..'Z' | '_' ];
rule  IdCont       @[ IdStart | '0'..'9' ];
If you were hoping that I'd point out the difference between token and rule at this point... nope, I'm saving that for the next article. It's not important yet; just use token for defining token rules (like Float, Int and Id) and you'll be fine.

Now, since Float and Int are ambiguous in LL(k), here's how I suggested combining them into a single rule:
token Number       @[ '.' '0'..'9'+
                    | '0'..'9'+ ('.' '0'..'9'+)? ];
That's perhaps the best solution; other solutions have pitfalls, but the alternatives are still really handy for learning about LLLPG! First of all, one solution that looks simpler is this one:
token Number      @[ '0'..'9'* ('.' '0'..'9'+)? ];
But now you'll have a problem, because this matches an empty input, or it matches "hello" without consuming any input. Therefore, LLLPG will complain that Token is "nullable" and therefore must not be used in a loop (see the Token* loop?). After all, if you call Number in a loop and it doesn't match anything, you'll have an infinite loop which is very bad.

Zero-width assertions: syntactic predicates

You can actually prevent it from matching an empty input as follows:
token Number @[ &('0'..'9'|'.')
                '0'..'9'* ('.' '0'..'9'+)? ];
Here I have introduced the zero-width assertion or ZWA, also called an and-predicate because it uses the "and" sign (&). There are two kinds of and-predicates, which are called the "syntactic predicate" and the "semantic predicate". This one is a syntactic predicate, which means that it tests for syntax, which means it tests a grammar fragment starting at the current position. And since it's a zero-width assertion, it does not consuming any input (more precisely, it consumes input and then backtracks to the starting position, regardless of success or failure). and-predicates have two forms, the positive form & which checks that a condition is true, and the negative form &! which checks that a condition is false.

So, &('0'..'9'|'.') means that the number must start with '0'..'9' or '.'. Now Number() cannot possibly match an empty input. Unfortunately, LLLPG is not smart enough to know that it cannot match an empty input; it does not currently analyze and-predicates at all (it merely runs them), so it doesn't understand the effect caused by &('0'..'9'|'.'). Consequently it will still complain that Token is nullable even though it isn't. Hopefully this will be fixed in a future version, when I or some other smart cookie has time to figure out how to perform the analysis.

A syntactic predicate causes two methods to be generated to handle it:
private bool Try_Number_Test0(int lookaheadAmt)
{
  using (new SavePosition(this, lookaheadAmt))
    return Number_Test0();
}
private bool Number_Test0()
{
  if (!TryMatchRange('.', '.', '0', '9'))
    return false;
  return true;
}
The first method assumes the existance of a data type called SavePosition (as usual, you can define it yourself or inherit it from a base class that I wrote.) SavePosition's job is to
  1. Save the current InputPosition, and then restore that position when Dispose() is called
  2. Increment InputPosition by lookaheadAmt (which is often zero, but not always), where 0 <= lookaheadAmt < k.
The Try_ method is used to start a syntactic predicate and then backtrack, regardless of whether the result was true or false. The second method decides whether the expected input is present at the current input position. Syntactic predicates also assume the existence of TryMatch and (for lexers) TryMatchRange methods, which must return true (and advance to the next the input position) if one of the expected characters (or tokens) is present, or false if not.

Zero-width assertions: semantic predicates

Moving on now, another approach is
token Number  @[ {dot::bool=false;}
                 ('.' {dot=true;})?
                 '0'..'9'+ (&{!dot} '.' '0'..'9'+)?
               ];
Here I use the other kind of ZWA, the semantic predicate, to test whether dot is false (&{!dot}, which can be written equivalently as &!{dot}). &{expr} simply specifies a condition that must be true in order to proceed; it is normally used to resolve ambiguity between two possible paths through the grammar. Semantic predicates are a distinctly simpler feature than syntactic predicates and were implemented first in LLLPG. They simply test the user-defined expression during prediction.

So, here I have created a "dot" flag which is set to "true" if the first character is a dot (you write {dot::bool=false;} in LES, but when LLLPG adds C# input support, you can write {bool dot=false;} instead). The sequence '.' '0'..'9'+ is only allowed if the "dot" flag has not been set. This approach works correctly; however, you must exercise caution when using &{...} because &{...} blocks may execute earlier than you might expect them to; this is explained below.

The expression inside &{...} can include the "substitution variables" $LI and $LA, which refer to the current lookahead index and the current lookahead value; these are useful if you want to run a test on the input character. For example, when if you want to detect letters, you might write
rule Letter @[ 'a'..'z' | 'A'..'Z' ];
token Word @[ Letter+ ];
but this doesn't detect all possible letters; there are áĉçèntéd letters to worry about, grΣεk letters, Russiaи letters and so on. I've been supporting these other letters with a semantic and-predicate:
rule Letter @[ 'a'..'z' | 'A'..'Z'| &{char.IsLetter($LA -> char)} 0x80..0xFFFC ];
[FullLLk] token Word @[ Letter+ ];
0x80..0xFFFC denotes all the non-ASCII characters supported by a .NET char, while the arrow operator -> is the LeMP/LES notation for a cast; the C# equivalent is char.IsLetter((char) $LA). $LA will be replaced with the appropriate lookahead token, which is most often LA0, but not always. Now, when you look at the generated code, you'll notice that the and-predicate has been copied into the Word rule:
void Word()
{
  int la0;
  Letter();
  for (;;) {
    la0 = LA0;
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Letter();
    else if (la0 >= 128 && la0 <= 65532) {
      la0 = LA0;
      if (char.IsLetter((char) la0))
        Letter();
      else
        break;
    } else
      break;
  }
}
Copying and-predicates across rules is normal behavior that occurs whenever one rule needs to use the and-predicate to decide whether to call another rule. ANTLR calls this "hoisting", so that's what I call it too: the predicate was hoisted from Letter to Word. (In this particular case I had to add FullLLk to make it work this way; more about that in a future article.)

Gates

Here's one final technique:
token Number  @[ '.'? '0'..'9'+ =>
                 '0'..'9'* ('.' '0'..'9'+)? ];
This example introduces a feature that I call "the gate". The grammar fragment ('.'? '0'..'9'+) before the gate operator => is not actually used by Number itself, but it can be used by the caller to decide whether to invoke the rule.

A gate is an advanced but simple mechanism to alter the way prediction works. Recall that parsing is a series of prediction and matching steps. First the parser decides what path to take next, which is called "prediction", then it matches based on that decision. Normally, prediction and matching are based on the same information. However, a gate => causes different information to be given to prediction and matching. The left-hand side of the gate is used for the purpose of prediction analysis; the right-hand side is used for matching.

The decision of whether or not Token will call the Number rule is a prediction decision, therefore it uses the left-hand side of the gate. This ensures that the caller will not believe that Number can match an empty input. When code is generated for Number itself, the left-hand side of the gate is ignored because it is not part of an "alts" (i.e. the gate expression is not embedded in a "*" or "+" loop or an optional element "?"). Instead, Number runs the matching code, which is the right-hand side, '0'..'9'* ('.' '0'..'9'+)?.

Gates are a way of lying to the prediction system. You are telling it to expect a certain input, then saying "no, no, match this other input instead." Gates are rarely needed, but they can provide simple solutions to certain tricky problems.

Please note that gates, unlike syntactic predicates, do not provide unlimited lookahead. For example, if k=2, the characters 'c' 'd' in ('a' 'b' 'c' 'd' => ...) will not have any effect.

The gate operator => has higher precedence than |, so a | b => c | d is parsed as a | (b => c) | d.

More about and-predicates

Let's see now, I was saying something about and-predicates running earlier than you might expect. Consider this example:
flag::bool = false;
public rule Paradox @[ {flag = true;} &{flag} 'x' / 'x' ];
Here I've introduced the "/" operator. It behaves identically to the "|" operator, but has the effect of suppressing warnings about ambiguity between the two branches (both branches match 'x' if flag == true, so they are certainly ambiguous).

What will the value of 'flag' be after you call Paradox()? Since both branches are the same ('x'), the only way LLLPG can make a decision is by testing the and-predicate &{flag}. But the actions {flag=false;} and {flag=true;} execute after prediction, so &{flag} actually runs first even though it appears to come after {flag=true;}. You can clearly see this when you look at the actual generated code:
bool flag = false;
public void Paradox()
{
  if (flag) {
    flag = true;
    Match('x');
  } else
    Match('x');
}
What happened? Well, LLLPG doesn't bother to read LA0 at all, because it won't help make a decision. So the usual prediction step is replaced with a test of the and-predicate &{flag}, and then the matching code runs ({flag = true;} 'x' for the left branch and 'x' for the right branch).

This example will give the following warning: "It's poor style to put a code block {} before an and-predicate &{} because the and-predicate normally runs first."

In a different sense, though, and-predicates might run after you might expect. Let's look at the code for this Number rule from earlier:
token Number  @[ {dot::bool=false;}
               ('.' {dot=true;})?
               '0'..'9'+ (&{!dot} '.' '0'..'9'+)?
             ];
The generated code for this rule is
void Number()
{
int la0, la1; bool dot = false; la0 = LA0; if (la0 == '.') { Skip(); dot = true; } MatchRange('0', '9'); for (;;) { la0 = LA0; if (la0 >= '0' && la0 <= '9') Skip(); else break; }
la0 = LA0; if (la0 == '.') { if (!dot) { la1 = LA(1); if (la1 >= '0' && la1 <= '9') { Skip(); Skip(); for (;;) { la0 = LA0; if (la0 >= '0' && la0 <= '9') Skip(); else break; } } } } }
Here I would draw your attention to the way that (&{!dot} '.' '0'..'9'+)? is handled: first LLLPG checks if (la0 == '.'), then if (!dot) afterward, even though &{!dot} is written first in the grammar. Another example shows more specifically how LLLPG behaves:
token Foo @[ (&{a()} 'A' &{b()} 'B')? ];

void Foo()
{
  int la0, la1;
  la0 = LA0;
  if (la0 == 'A') {
    if (a()) {
      la1 = LA(1);
      if (la1 == 'B') {
        if (b()) {
          Skip();
          Skip();
        }
      }
    }
  }
}
First LLLPG tests for 'A', then it checks &{a()}, then it tests for 'B', and finally it checks &{b()}; it is as if the and-predicates are being "bumped" one position to the right. Actually, I decided that all zero-width assersions should work this way for the sake of performance. To understand this, consider the Letter and Word rules from earlier:
  rule Letter @[ 'a'..'z' | 'A'..'Z'| &{char.IsLetter($LA -> char)} 0x80..0xFFFC ];
  [FullLLk] token Word @[ Letter+ ];
In the code for Word you can see that char.IsLetter is called after the tests on LA0:
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Letter();
    else if (la0 >= 128 && la0 <= 65532) {
      la0 = LA0;
      if (char.IsLetter((char) la0))
        Letter();
      else
        break;
    } else
      break;
And this makes sense; if char.IsLetter were called first, there would be little point in testing for 'a'..'z' | 'A'..'Z' at all. It makes even more sense in the larger context of a "Token" rule like this one:
[FullLLk] rule Token @[ Spaces / Word / Number / Punctuation / Comma / _ ];
The Token method will look something like this (some newlines removed for brevity):
void Token()
{
  int la0, la1;
  la0 = LA0;
  switch (la0) {
  case '\t':  case ' ':
    Spaces();
    break;
  case '.':
    {
      la1 = LA(1);
      if (la1 >= '0' && la1 <= '9')
        Number();
      else
        Punctuation();
    }
    break;
  case '0':  case '1':  case '2':  case '3':  case '4':
  case '5':  case '6':  case '7':  case '8':  case '9':
    Number();
    break;
  case '!':  case '#':  case '$':  case '%':  case '&':
  case '*':  case '+':  case ',':  case '-':  case '/':
  case '<':  case '=':  case '>':  case '?':  case '@':
  case '\\': case '^':  case '|':
    Punctuation();
    break;
  default:
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Word();
    else if (la0 >= 128 && la0 <= 65532) {
      la0 = LA0;
      if (char.IsLetter((char) la0))
        Word();
      else
        MatchExcept();
    } else
      MatchExcept();
    break;
  }
}
In this example, clearly it makes more sense to examine LA0 before checking &{char.IsLetter(...)}. If LLLPG invoked the and-predicate first, the code would have the form
void Token()
{
  int la0, la1;
  la0 = LA0;
  if (char.IsLetter((char) la0)) {
    switch (la0) {
      // All the same switch cases as before
    }
  } else {
    switch (la0) {
      // All the same switch cases again!
    }
  }
}
The code of Token would be much longer, and slower too, since we'd call char.IsLetter on every single input character, not just the ones in the unicode range 0x80..0xFFFC. Clearly, then, as a general rule it's good that LLLPG tests the character values before the ZWAs.

Underscore and tilde

The underscore _ means "match any terminal", while ~(X|Y) means "match any terminal except X or Y". The next section has an example that uses both. In ~(X|Y), X and Y must be terminals (if X and/or Y are non-terminals, consider using something like &!(X|Y) _ instead.)

Setting lookahead

Pure LL(k) parsers look up to k terminals ahead to make a branching decision, and once a decision is make they stick to it, they don't "backtrack" or try something else. So if k is too low, LLLPG will generate code that makes incorrect decisions.

LLLPG's default k value is 2, which is enough in the majority of situations, as long as your grammar is designed to be LL(k). To increase k to X, simply add a [DefaultK(X)] attribute to the grammar (i.e. the LLLPG statement), or add a [k(X)] attribute to a single rule. Here's an example that represents "double-quoted" and """triple-quoted""" strings, where k=2 two is not enough:
private token DQString @[
 '"' ('\\' _  | ~('"'|'\\'|'\r'|'\n'))* '"'? ];
];
[k(4)]
private token TQString @[
 '"' '"' '"' nongreedy(Newline / _)* '"' '"' '"' 
 "'''"       nongreedy(Newline / _)* "'''"
];
[k(4)]
private token Token @[
 ( {_type = TT.Spaces;}    Spaces
 ...
 | {_type = TT.String;}    TQString
 | {_type = TT.String;}    DQString
 ...
 )
];
Here I've used "_" inside both kinds of strings, meaning "match any character", but of course this implies that the string can go on and on forever. To fix that, I add nongreedy meaning "exit the loop when it makes sense to do so" (greedy and nongreedy are explained more in my blog.)

With only two characters of lookahead, LLLPG cannot tell whether """this""" is an empty DQString ("") or a triple-quoted TQString. Since TQString is listed first, LLLPG will always choose TQString when a Token starts with "", but of course this may be the wrong decision. You'll also get a warning like this one:
warning : Loyc.LLParserGenerator.Macros.run_LLLPG: 
Alternatives (4, 5) are ambiguous for input such as «""» (["], ["])
[k(3)] is sufficient in this case, but it's okay if you use a number that is a little higher than necessary, so I've used [k(4)] here. In "bad" grammars that are always ambiguous in LL(k) (e.g. LL(*) or left-recursive grammars), I suspect that a high k value can lead to very large generated code, which is why I don't recommend using a really high DefaultK value.

Wrapping Up

By now we've covered all the basic features, and after you read my blog post on greedy & nongreedy, you should be able to write some pretty fancy parsers!

At the top of the article I discussed the Custom Tool (single-file generator) for VS2008-VS2013, and the LES syntax highlighter. For now, the LES syntax highlighter only works in Visual Studio 2013 (Pro), and it doesn't have all the features I want yet. In particular, it does't highlight the important difference between "foo(x)" and "foo (x)" (see my Loyc Expression Syntax article for details.)

In future articles I'll talk more about the following topics. Let me know what you'd like me to cover first!
  • token versus rule.
  • The API that LLLPG uses (and typical base class implementations): Match(), Skip(), MatchRange(), LA(k), LT(k), etc.
  • Saving inputs: The value of a terminal can be assigned to a variable using x:='a'..'z' to create a variable 'x', x='a'..'z' to assign the value to an existing variable 'x', or xs+='a'..'z' to add the value to a list 'xs' by calling list.Add(MatchRange('a', 'z')).
  • Managing ambiguity: most computer languages are actually somewhat ambiguous. LLLPG will first detect ambiguity and warn you about it using an example of an ambiguous input ("Alternatives (2, 4) are ambiguous for input such as «0x_»"). Branches are specified in priority order, and you can suppress warnings by separating branches with / (which works the same way as |, but slanted). In case of ambiguity with the exit branch, you can use "greedy" or "nongreedy" to specify whether looping or exiting, respectively, should have higher priority.
  • Error handling: By default, LLLPG generates prediction code with the assumption the input is always well-formed. But there are a couple of ways to handle errors, including an "error branch" as in (a | b | error { HandleError(); }), and the [NoDefaultArm] grammar attribute. Alternatively, you can designate one branch as the default (the default default branch is the last branch or the exit branch.) 
  • [FullLLk] versus "approximate" LL(k): by default, LLLPG uses a prediction system slightly weaker than LL(k). It resembles Terrance Parr's "Linear Approximate Lookahead", except that my version is a bit more powerful.
  • My favorite feature that LLLPG doesn't have: length-based disambiguation.
  • A real-life example: parsing LES, or other stuff. Of course, the sample code included with this article is almost a real-life example already. Other parsing suggestions welcome.
  • How-to: "tree parsing": how to use LLLPG with three-stage parsing, where you group tokens by parenthesis and braces before the main parsing stage.
  • How-to: keyword parsing: LLLPG can handle keywords directly, although the generated code is rather large.
  • How LLLPG works: a brief tour of the source code.
  • All about Loyc and its libraries
  • Things you can do with LeMP: other source code manipulators besides LLLPG.
  • Do you want LLLPG to support a language other than C#? Let me know which language. Better yet, ask me how to write a new output module, and I'll write a whole article just about that.
I would also love to hear how you're planning to use LLLPG in your own projects. And don't forget to share this article with your compiler-writing friends and university professors! (Dude, your friends write compilers? I wish I had friends like that.)

Sunday, November 3, 2013

Grammars: theory vs practice

Theoretical literature is often written in a baffling mathematical form, so I decided to write this post to help people understand the theoretical representation of grammars. My parser generator, LLLPG, does not actually use the representation described here, though.

If you ever try to read theoretical parsing literature, you might be completely confused by the way grammars are defined, because the standard definition does not resemble the grammars of ANTLR, LLLPG or other tools. For example, Wikipedia gives this definition of a grammar:
A context-free grammar G is defined by the 4-tuple G = (V, Σ, R, S) where
  1. V is a finite set; each element v in V is called a non-terminal character or a variable. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by G.
  2. Σ is a finite set of terminals, disjoint from V, which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar G.
  3. R is a finite relation from V to (V ∪ Σ)*, where the asterisk represents the Kleene star operation. The members of R are called the (rewrite) rules or productions of the grammar. (also commonly symbolized by a P)
  4. S is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of V.
Decoding the math-speak: V is a list of possible nonterminals (i.e. rules), Σ is a list of possible terminals (input characters or tokens), and R contains the definitions of the nonterminals (i.e. V is only the rule names, R has the rule definitions). The symbol ∪ means "set union", and the expression "(V ∪ Σ)*" means "a sequence of zero or more elements from V or Σ", i.e. an ordered list of terminals and nonterminals.

I'll explain what this means by translating an LLLPG grammar to its theoretical 4-tuple representation, G = (V, Σ, R, S). So consider this grammar, which represents a list of numbers separated by spaces (technically it represents a single number or some spaces, but of course we can just invoke Token repeatedly to get a list):
  rule Token  @[ Spaces | Number ];
  rule Spaces @[ (' '|'\t')+ ];
  rule Number @[ '0'..'9'+ ('.' '0'..'9'+)? ];
We can immediately see what V, Σ, and S are. V is { Token, Spaces, Number }, and in university they would probably figure that Σ is the 13 possible input characters, { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '.', ' ', '\t' }. In real life, though, there's nothing to restrict the input to these 13 characters; your input could probably include any bytes, or any unicode characters (depending on how you get your input), and the definition of Σ is irrelevant. That is to say, all that matters is that Σ contains characters; who cares what the precise set is? I certainly don't, so let's move on. The start symbol S is the one that uses the others, so S must be Token in this example.

That just leaves R. What's R? "a finite relation from V to (V ∪ Σ)*". What Wikipedia means to say is that R is a set of rule → content pairs, where "rule" is a member of V and "content" is a sequence of items (terminals and nonterminals) that v can expand to. "rule" can appear more than once on the left side; for example, R will contain two entries for Token:
 Token → Spaces
 Token → Number
Now, the right-hand side must only be a simple list; for example, there is no way to express Spaces | Number or (' '|'\t')+ in R. Instead, a list of alternatives like Spaces | Number must be split into multiple pairs (as I've shown already), and loops like (' '|'\t')+ must be split into new rules. Basically, before you can figure out what R is, you have to eliminate all the loops and optional elements from your grammar, like this:
  rule Token        @[ Spaces | Number ];
  
  rule Spaces       @[ Space SpacesOpt ];
  rule SpacesOpt    @[ Spaces | () ];     // equivalent to @[ Spaces? ]
  rule Space        @[ ' ' | '\t' ];

  rule Number       @[ Digits DotDigitsOpt ];
  rule DotDigitsOps @[ '.' Digits | () ];
  rule Digits       @[ Digit DigitsOpt ];
  rule DigitsOpt    @[ Digits | () ];     // equivalent to @[ Digits? ]
  // Oh, and you have to eliminate ranges too. No '0'..'9' allowed.
  rule Digit        @[ '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' ];
In most ways, this grammar is the same as the original one: the original grammar was LL(1); this one is still LL(1), and it represents the same language (either a number, or a sequence of spaces).

The theoretical definition of R doesn't allow any loops because they are not strictly necessary. It turns out that it's always possible to eliminate loops (and optional elements) from a grammar by defining a bunch of new rules (and it's not difficult either, just tedius to do by hand). Wikipedia's definition of R is the simplest possible definition, and simple definitions tend to be more useful for mathematical analysis than complex definitions. So R doesn't support loops, nor optional elements, in order to make mathematical analysis of grammar theory easier (at the cost of making a grammar more verbose, and cumbersome to describe).

I lied before. We actually can't say what V is until we've eliminated all the loops and options. Now we can see that V is not { Token, Spaces, Number }, but rather { Token, Spaces, SpacesOpt, Space, Number, DotDigitsOpt, Digits, DigitsOpt, Digit }. Given the loop-free grammar above, we can finally state the complete contents of R:
{
 Token → Spaces
 Token → Number
 Spaces → Space, SpacesOpt
 SpacesOpt → Spaces
 SpacesOpt → ε
 Space → ' '
 Space → '\t'
 Number → Digits, DotDigitsOpt
 DotDigitsOpt → '.', Digits
 DotDigitsOpt → ε
 Digits → Digit, DigitsOpt
 DigitsOpt → Digits
 DigitsOpt → ε
 Digit → '0'
 Digit → '1'
 Digit → '2'
 Digit → '3'
 Digit → '4'
 Digit → '5'
 Digit → '6'
 Digit → '7'
 Digit → '8'
 Digit → '9'
}
One more thing, in grammar theory, ε represents nothing (an empty list, which is neither a terminal nor nonterminal).

That's it! Now you can see the relationship between "user-friendly" grammars like you use with LLLPG, and "theoretical" grammars used by university professors in comp-sci departments.

By the way, the theoretical definition of a grammar is based on Noam Chomsky's "generative grammar" concept which says "you expand nonterminals one or more times until all the nonterminals are gone". The "generative grammar" concept, which was originally conceived for the study of natural (human) languages, is oriented around speakers rather than listeners. So rather than parsing "13.5" into a token, the "generative grammar" point of view looks at it in reverse, saying "we expand Token to get Number, then we expand number to get the sequence of characters 13.5"; that is, it takes the perspective of a printer rather than a parser. This explains why theoretical documents will talk about rules "expanding", which is the opposite of a parser's goal (which is to "contract" all the terminals into a single nonterminal).

You might wonder where LLLPG features like zero-width assertions (&foo) or greedy/nongreedy fit into all this. Answer: they don't fit. The theory around generative grammars does not have any concept of a zero-with assertion, and if your grammar uses a zero-width assertion then it is not an LL(k) grammar. Meanwhile, greedy and nongreedy are used to resolve LL(k) ambiguities. Grammar theory does include the concept of ambiguity, but not so much mechanisms for dealing with ambiguity. I don't completely understand the theory, but I think if you have to use greedy or nongreedy, then perhaps your grammar isn't "truly" LL(k), or maybe it's more correct to say "the grammar is ambiguous in LL(k)". That is, an LL(k) parser cannot parse the grammar unambiguously (which necessitates a mechanism, such as greedy/nongreedy, to choose a single interpretation of the input, out of the multiple possible interpretations.)

< Published on >

Saturday, November 2, 2013

LLLPG: greedy and nongreedy

(Edit: This blog post was incorporated into the fourth article about LLLPG.)

LLLPG supports "greedy" and "nongreedy" loops and optional items. The "greedy" and "nongreedy" modes refer to the action you prefer to take in case of ambiguity between an exit branch and another branch. Greedy is the default: it means that if the input matches both a non-exit branch and an exit branch, the non-exit branch should be taken. A typical greedy example is this rule for an "if" statement:
  private token IfStmt @[ 
    // "if" "(" Expr ")" Stmt ("else" Stmt)?
    TT.If TT.LParen Expr TT.RParen Stmt (greedy(TT.Else Stmt))?
  ];
Note: currently you need extra parens around greedy(...) or nongreedy(...) due to a parser bug, sorry about that. Without the parens you'll get the error "Unrecognized expression. Treating it as a terminal."

In this case, it is possible that the "if" statement is nested inside another "if" statement. Given that the input could be something like
if (expr) if (expr)
  Stmt();
else
  Stmt();
It is, in general, ambiguous whether to consume TT.Else Stmt or to exit, because the else clause could be paired with the first "if" or the second one. The "greedy" modifier, which must always be paired with a loop or option operator (* + ?) means "in case of ambiguity with the exit branch, do not exit and do not print a warning." Since greedy behavior is the default, the greedy modifier's only purpose is to suppress the warning.

Now, you might logically think that changing "greedy" to "nongreedy" would cause the "else" to match with the outer "if" statement rather than the inner one. Unfortunately, that's not what happens! It does not work because the code generated for IfStmt is not aware of the run-time call stack leading up to it: it does not know whether it is nested inside another IfStmt or not. LLLPG only knows that it could be nested inside another "if" statement; the technical jargon for this is that the follow set of the IfStmt rule includes TT.Else Stmt.

What actually happens is that nongreedy(TT.Else Stmt)? will never match TT.Else, and LLLPG will give you a warning that "branch 1 is unreachable". Not knowing the actual context in which IfStmt was called, LLLPG is programmed to assume that all possible follow sets of IfStmt apply simultaneously, even though in reality IfStmt is called in one specific context. The statically computed follow set of IfStmt, which is based on all possible contexts where IfStmt might appear, includes TT.Else Stmt, and nongreedy uses this information to decide, unconditionally, to let the exit branch win. To put it another way, LLLPG behaves as if IfStmt is always called from inside another IfStmt, when in reality it merely might be. It would be fairly difficult for LLLPG to behave any other way; how is the IfStmt() method supposed to know call stack of other rules that called it?

By the way, I have the impression that the formal way of describing this limitation of LLLPG's behavior is to say that LLLPG supports only "strong" LL(k) grammars, not "general" LL(k) grammars (this is true even when you use FullLLk(true)).

So at the end of a rule, LLLPG makes decisions based on all possible contexts of that rule, rather than the actual context. Consequently, nongreedy is not as useful as it could be. However, nongreedy still has its uses. Good examples include strings and comments:
  token TQString @[ "'''" (nongreedy(_))* "'''" ];
  token MLComment @[ "/*" (nongreedy(MLComment / _))* "*/" ];
This introduces the single underscore _, which matches any single terminal (not including EOF).

The first example defines the syntax of triple-quoted strings '''like this one'''. The contents of the string are any sequence of characters except "'''", which ends the string. The nongreedy modifier is important; without it, the loop (_)* will simply consume all characters until end of file, and then produce errors because the expected "'''" was not found at EOF.

The second example for /* multi-line comments */ is similar, except that (just for fun) I decided to support nested multi-line comments by calling the MLComment rule recursively.

There's actually a bug in TQString, assuming that LLLPG is left in its default configuration. Moreover, LLLPG will not print a warning about it. Got any idea what the bug is? I'm about to spoil the answer, so if you want to give it some thought, do so now before you start glancing at the lower half of this paragraph. Well, if you actually tested this code you might notice that a string like '''one''two''' will be parsed incorrectly, because two quotes, not three, will cause the loop to exit. The reason is that the default maximum lookahead is 2, so two quotes are enough to make LLLPG decide to exit the loop (and then the third Match('\'') in the generated code will fail). To fix this, simply add a [k(3)] attribute to the rule. No warning was printed because half the purpose of nongreedy is to suppress warnings; after all, mixing (_)* with anything else is inherently ambiguous and will frequently cause a warning that you must suppress.

Today I ran into an unfortunate situation in which neither greedy nor nongreedy was appropriate. I was writing a Visual Studio "classifier" for syntax-highlighting of LES, and I decided to use a line-based design where lexing would always start at the beginning of a line. Therefore, I needed to keep track of which lines started inside multi-line comments and triple-quoted strings. Now, if a line starts inside a comment or string, I invoke a special rule that is designed to parse the rest of the comment or string, or stop at the end of the line. Since LES supports nested multi-line comments, I wrote the following rule:
  public token MLCommentLine(ref nested::int)::bool @[ 
    (nongreedy
      ( &{nested>0} "*/" {nested--;}
      / "/*" {nested++;}
      / ~('\r'|'\n')
      ))*
    (Newline {return false;} | "*/" {return true;})
  ];
This rule takes the current comment nesting level as an argument (0 = comment is not nested) and updates the nesting level if it changes during the current line of code. The loop has three arms:
  1. For input of "*/" when comments are nested, reduce the nesting level
  2. For input of "/*", increase the nesting level
  3. For input of anything else (not including a newline), consume one character.
I chose "nongreedy" because otherwise the third branch ~('\r'|'\n') will match the first character of "*/", so the loop would never exit. But this didn't work; LLLPG gave the warning "branch 1 is unreachable". Why is it unreachable? I have to admit, I couldn't figure it out at first. If you feel like you're stumped by LLLPG warnings sometimes, you're not alone, they sometimes confuse me too. In this case I was confused because I thought the predicate &{nested>0} would choose whether to stay in the loop or exit. But in fact nongreedy gives the exit branch a higher priority than the first branch, so regardless of whether &{nested>0}, LLLPG will always choose the exit branch when the input is "*/".

At that point I realized that what I wanted was a loop that was neither greedy nor nongreedy, in which the priority of the exit branch is somewhere in the middle. I wanted to be able to write something like this, where "exit" is higher priority than ~('\r'|'\n') but lower priority than &{nested>0} "*/":
  public token MLCommentLine(ref nested::int)::bool @[ 
    ( &{nested>0} "*/" {nested--;}
    / "/*" {nested++;}
    / exit
    / ~('\r'|'\n')
    )*
    (Newline {return false;} | "*/" {return true;})
  ];
Unfortunately, LLLPG does not support this notation. Maybe in a future version. Here's what I did instead:
  public token MLCommentLine(ref nested::int)::bool @[ 
    (greedy
      ( &{nested>0} "*/" {nested--;}
      / "/*" {nested++;}
      / ~('\r'|'\n'|'*')
      / '*' (&!'/')
      ))*
    (Newline {return false;} | "*/" {return true;})
   ];
Here, I switched back to a greedy loop and added '*' as its own branch with an extra check to make sure '*' is not followed by '/'. If the test &!'/' succeeds, the new fourth branch matches the '*' character (but not the character afterward); otherwise the loop exits. I could have also written it like this, with only three branches:
  public token MLCommentLine(ref nested::int)::bool @[ 
    (greedy
      ( &{nested>0} "*/" {nested--;}
      / "/*" {nested++;}
      / (&!"*/") ~('\r'|'\n')
      ))*
    (Newline {return false;} | "*/" {return true;})
  ];
However, this version is slower, because LLLPG will actually run the &!"*/" test on every character within the comment.

Here's one more example using nongreedy:
// Parsing a comma-separated value file (.csv)
public rule CSVFile @[ Line* ];
rule Line           @[ Field greedy(',' Field)* (Newline | EOF) ];
rule Newline        @[ ('\r' '\n'?) | '\n' ];
rule Field          @[ nongreedy(_)*
                     | '"' ('"' '"' | nongreedy(~('\n'|'\r'))* '"' ];
This grammar describes a file filled with fields separated by commas (plus I introduced the EOF symbol, so that no Newline is required at the end of the last line). Notice that 'Field' has the loop nongreedy(_)*. How does LLLPG know to when to break out of the loop? Because it computes the "follow set" or "return address" of each rule. In this case, 'Field' can be followed by ','|'\n'|'\r'|EOF, so the loop will break as soon as one of these characters is encountered. This is different than the IfStmt example above in an important respect: Field always has the same follow set. Even though Field is called from two different places, the follow set is the same in both locations: ','|'\n'|'\r'|EOF. So nongreedy works reliably in this example because it makes no difference what context Field was called from.

Friday, October 25, 2013

My #1 Tip for Technical Writers

There are so many docs out there that don't seem to get this. If I could offer one tip for anyone who writes technical or scientific documentation about anything, it's this: use examples. Lots of examples, the more the better. Not just examples--put explanatory prose beside, above or below your examples, but use examples liberally. Diagrams and pictures don't hurt either.

It's fundamentally hard to put abstract technical concepts into words, and I've noticed that most people aren't that good at it. Human language is quite ambiguous and besides, we often make mistakes in our writing that can make our prose confusing. I think I'm better than average, but often when I re-read something I wrote a long time ago, I realize that it's ambiguous or that it hasn't given all the information necessary for a novice to understand it, and I have to clarify. (TODO: give an example here.)

The only trouble with giving examples is that it can be hard to find good ones. For example, I'd like to give examples from my own life of the things I'm talking about in this post, but it's hard for me to remember my own life, so nothing is coming to mind.

Examples can often provide the clarity that the text is lacking, because
  1. they restate the information in a completely different form, giving the reader a different perspective on the topic.
  2. they make abstract ideas concrete, or visualizable in the mind's eye. Such concrete things are easier to understand and reason about than abstract generalizations.
  3. they can constrain the possible interpretations of the main text: they can resolve ambiguity by showing that certain interpretations are impossible.
A poor example is one that doesn't have the above advantages:
  1. If the example is too general and somehow matches the abstract description of a concept, it isn't very helpful. For example, if you're trying to explain what a list comprehension is, S = { f(x) | x ϵ T } is a terrible example; it demonstrates the syntax, but nothing else. A better example would be something like Doubled = { 2n | n ϵ Original }. Once you explain that {...} represents a set of things (in this case, numbers), that "ϵ" means "is a member of", and "|" means "where", the reader can start to understand the explanation: this takes all the numbers from a set called "Original", doubles them, and uses "Doubled" as the name for the result.
  2. If the example isn't something concrete, it may not be very good. For example if you want to explain how to use an API called, let's say, Scan(), and your example is simply Scan(foo, bar), it's worthless since "foo" and "bar" are not concrete things that people can imagine. That's not to say one should never use "foo" and "bar", but if your example will make more sense if it is based on a real-world scenario, then base your example on a real-world scenario.
  3. They can constrain the possible interpretations of the main text: they can resolve ambiguity by showing that certain interpretations are impossible. For example, suppose a small child just learned to add numbers, and now you want to teach the child about multiplication. Then "2 × 2 = 4" is a terrible example to introduce multiplication: the child already knows "2 + 2 = 4", and it looks like you just rotated the "+" sign. The example sucks because it fails to show how multiplication is different from related concepts. Likewise "3 × 1 = 3" is a bad example: it does not illustrate the purpose of multiplication (and obviously you shouldn't start with "3 × -4 = -12", it's too advanced). Eventually you can show these examples if you want, but first you must show what multiplication is, how it is used and why it is useful.
One more thing, when giving code examples, make sure they compile, and preferably, mention the environmental conditions required to compile them (#include files, assembly references, compiler options, etc.)

Tuesday, October 8, 2013

LLLPG: almost done

Yesterday I published a new article about the Loyc LL(k) Parser Generator.

Formalism tutorial

Have you ever seen notation like this in a programming language paper?
I have. And it's completely frustrating to encounter notation like this, because there is no obvious way to find out what it means. The paper itself will not give a name to the notation they are using, and there is no way to Google for the symbols that are used. Even if you can find one of these symbols in Character Map or whatever--and it probably won't be listed under standard fonts--Google responds with something like
Your search - ∏ - did not match any documents.
Luckily, I stumbled upon this page which briefly explains, and gives names for, these notations. For example: "|-" means "entails", "<:" is a subtype relation, ⊥ is "bottom", its inverse Τ is "top", Γ is "context", and the long horizontal line is sort of an if-then statement:
"if everything above the bar (the premises) is true, then the thing below the bar (the conclusion) is true"
Excellent, this will be very helpful when I encounter another one of these papers. I wish computer scientists understood that their notation is far from universally known and that, without a "legend" or informative URL, mere computer engineers like me not only cannot understand the notation, but cannot learn about it either.

Update: here's a more detailed introduction to Programming Language Theory.

Friday, September 13, 2013

Exploring the Nemerle macro system

To illustrate the properties of Nemerle's macro system, I will show a list of macros, followed by a program that calls the macros.
/* Macro.n */
using System;
using System.Linq;
using System.Console;

namespace Macro {
  using P;
  public macro Macro1(e:PExpr) {
    // This macro calls Console.WriteLine(...) even if the call site is
    // not "using System.Console".
    <[ WriteLine($e); ]>
  }
  public macro Macro2(e:PExpr) {
    // This macro creates its own "x" variable and prints it. Notice
    // that Main() calls Macro2() twice and there is no error on the
    // second call. So two separate versions of "x" are being defined.
    // (It's important that "x" is mutable. You can redefine an 
    // immutable variable, but not a mutable one.)
    <[ mutable x = $e; WriteLine(x); ]>
  }
  public macro Macro3(e:PExpr) {
    // This macro tries to assign an expression to an existing
    // "x" variable, but it always fails at expansion time; we
    // can't use an "x" defined by the caller.
    <[     x = $e; WriteLine(x); ]>
  }
  public macro Macro4(e:PExpr) {
    // This macro shows the effects of Nemerle's hygiene system.
    // The first statement creates an "x" variable and the second
    // statement (returned by Q.Abs()) takes the absolute value of
    // that "x" variable. The way this works is a little peculiar
    // because:
    // - Q.Abs() is not aware of x, but somehow can still use it
    // - Q.Abs() returns <[ Abs(x) ]>, but Macro4's context is not 
    //   aware of Abs(). P.Q is using System.Math, but the namespace 
    //   of Macro4 is NOT using System.Math... yet it still works.
    def stmt1 = <[ def x = $e; ]>;
    def stmt2 = Q.Abs();
    <[ $stmt1; $stmt2; ]>;
  }
  namespace P {
    using System.Math;
    using Nemerle.Compiler.Parsetree; // for PExpr
    internal partial module Q
    {
      public Abs() : PExpr { <[ Abs(x); ]> }
      public static Seven = 777;
    }
  }
  public macro Macro5(e:PExpr) { 
    // Although "WriteLine" is normally interpreted as Console.WriteLine, 
    // we can also define a "WriteLine" variable and it will not be 
    // misinterpreted as Console.WriteLine.
    <[
      {
        def WriteLine = $e;
        Console.WriteLine(WriteLine);
      }
      WriteLine($e);
    ]>
  }
  public macro Macro6(e:PExpr) { 
    // This example is tricky. We "call" some unknown expression and
    // then call writeline. The call site in Main() actually invokes
    // Macro6(MakeALambdaCalled), so the first line expands to 
    // MakeALambdaCalled(WriteLine). So the MakeALambdaCalled macro
    // is invoked, which creates a lambda in a variable called "WriteLine"
    // that calls Trace.WriteLine() instead of Console.WriteLine().
    // Consequently, the line WriteLine("Hello, Macro") calls 
    // Trace.WriteLine() instead of Console.WriteLine().
    // 
    // So what's the point of this example? It says something about how
    // Nemerle binds variables. It proves that the compiler does not
    // decide what "WriteLine" means inside any of the macros themselves, 
    // but rather the decision is made after the macros have been fully 
    // expanded.
    <[
      $e(WriteLine);
      WriteLine("Hello, Macro6!");
    ]> 
  }
  public macro MakeALambdaCalled(name:PExpr) { <[
    def $name = s => System.Diagnostics.Trace.WriteLine(s);
  ]> }
  public macro Macro7() {
    // A demonstration of how Nemerle does name lookup. In this example we
    // are apparently referring to P.Q.Seven which is 777. However, when
    // the macro is called from Main(), it actually uses System.Linq.Q.Seven
    // which is defined as "seven". Notice that Main() doesn't have a 
    // "using System.Linq" -- the compiler picked up the "using" from
    // this file. Also notice that System.Collections.Generic.Q.Seven exists
    // but the compiler ignores it.
    <[ Q.Seven ]>
  }
}
/* Main.n */
using System.Collections.Generic;
using System;
using Macro;

namespace System.Linq {
  module Q {
    public static Seven = "seven";
  }
}
namespace System.Collections.Generic {
  module Q {
    public static Seven = 7;
  }
}

module Program {
  Main() : void {
    Macro1("Hi!");          // Output: "Hi!"
    //WriteLine("Hi!");     // Error: unbound name 'WriteLine'
    Macro2(2.0);            // Output: 2
    Macro2("Too");          // Output: "Too"
    
    mutable x = "";
    //Macro3("Three");      // Error: unbound name 'x'
    def four = Macro4(-4);  // { def x = -4; System.Math.Abs(x) }
    Console.WriteLine(four);// Output: "4"
    mutable five = "five"; 
    Macro5(five);           // Output: two lines: "five", "five"
    Macro6(MakeALambdaCalled); // "Hello, Macro6!" as debug trace
    Console.WriteLine(Macro7()); // Output: "seven"
    _ = Console.ReadKey(false);
  }
}

Tuesday, September 10, 2013

Extra, extra, read all about Loyc

I recently set up a MediaWiki for Loyc's documentation. If you'd like to learn more about Loyc, go there and read

Saturday, August 24, 2013

Protobuf-net: the unofficial manual

Protobuf-net is a fast and versatile .NET library for serialization based on Google's Protocol Buffers. It's one of those libraries that is both widely used and poorly documented, so usage information is scattered across the internet. Writing a blog post with lots of random information might not solve the situation, but it's better than nothing, right? The majority of the web documentation is in GettingStarted, and small amounts of information exist in XML doc-comments in the source code (which will appear automatically in Visual Studio IntelliSense provided that your protobuf-net.dll file has a protobuf-net.xml file beside it.) The doc-comments are rarely more than a vague hint, though, about what something does or means.

So here are some things I've learned about this library that may not be obvious. I've mentioned a bunch of things that I don't know or that I merely inferred. If you know something I don't, leave a comment!

Overview

Protobuf-net is not compatible with the standard .NET serialization system. It can be configured to behave in a fairly similar way, but one should be aware that
  • protobuf-net ignores the [Serializable] attribute
  • protobuf-net does not support custom serialization using ISerializable and a private constructor, and AFAIK it does not offer something very similar.
However, it does offer several approaches to serialization on a type-by-type basis (see next section).

Protobuf is, of course, limited by the way protocol buffers work. Protocol buffers do not define a concept of "record type" (the deserializer is supposed to know the data type in advance, so you must specify a data type at the root level when deserializing), they have only a few wire formats for data, and all fields in a protocol buffer have only numeric identifiers (called "tags"), not strings. That's why you're supposed to put a [ProtoMember(N)] attribute on every field or property that you want to be serialized.

Standard Protocol Buffers offer no way of sharing objects (using the same object in two places) or supporting reference cycles, because they are simply not designed to be a serialization mechanism. That said, protobuf-net does support references and cycles (more information below). Supporting references and cycles is more expensive in terms of time and output size, so it's disabled by default. There are at least two ways to opt-in:
  1. If you know that instances of a particular class are often shared, use [ProtoContract(AsReferenceDefault=true)]. This will apply to all fields of that type and collections that contain that type.
  2. You can opt-in to the reference system on an individual field using [ProtoMember(N, AsReference=true)]. References will be shared with any other field that also uses AsReference=true. If a certain object is placed both in fields that have AsReference=true and fields that do not, the fields where AsReference=false will hold duplicate copies of the object, and these copies will be deserialized into separate objects. I can only assume that AsReference=false will override AsReferenceDefault=true.
When you use AsReference=true on a collection type, the instances inside the collection are tracked by reference. However, it looks like the collection itself is not tracked by reference. If the same collection object is used in multiple places, it will be serialized multiple times and these copies will not be consolidated during deserialization.

Protobuf is equally comfortable serializing fields and properties, and it can serialize both public and private fields and properties (well, maybe private things won't serialize in Silverlight or partial-trust? I suspect there's a security prohibition in Silverlight). Fields and properties that do not have a [ProtoMember(N)] attribute are normally left unserialized, unless the class has the ImplicitFields option, [ProtoContract(ImplicitFields = ImplicitFields.AllPublic)] or [ProtoContract(ImplicitFields = ImplicitFields.AllFields)], which causes numbers to be assigned automatically.

I can't find documentation for how the numbers are auto-assigned, but it appears to assign numbers to your fields/properties in alphabetical order starting at 1. It will start numbering at 1 even if your class uses ProtoMember(N) explicitly on some fields. The ProtoMember attribute is not ignored, it just doesn't affect the numbering of the fields that don't use ProtoMember, and tag numbers may end up in conflict (e.g. if you use ProtoMember(1), the first auto-assigned number will still be 1, causing a conflict.)

When using ImplicitFields, protobuf-net ignores fields/properties marked [ProtoIgnore]. Just to be clear, if you don't use ImplicitFields, fields and properties are ignored by default and must be explicitly serialized with [ProtoMember(N)].

Protocol buffers don't have an inheritance concept per se, but protobuf-net supports inheritance if you specify [ProtoInclude(N, typeof(DerivedClass))] on the base class. There are multiple ways that inheritance could work. Protobuf-net's approach is to define an optional field in the base class for each possible derived class; the field number N in ProtoInclude refers to one of these optional fields. For example, the following classes:

[ProtoContract(ImplicitFields = ImplicitFields.AllFields)]
[ProtoInclude(100, typeof(Derived))]
[ProtoInclude(101, typeof(Derive2))]
class Base { int Old; }

[ProtoContract(ImplicitFields = ImplicitFields.AllFields)]
class Derived : Base { int New; }
[ProtoContract(ImplicitFields = ImplicitFields.AllFields)]
class Derive2 : Base { int Eew; }

have the following schema:

message Base {
   optional int32 Old = 1 [default = 0];
   // the following represent sub-types; at most 1 should have a value
   optional Derived Derived = 100;
   optional Derived Derive2 = 101;
}
message Derived {
   optional int32 New = 1 [default = 0];
}
message Derive2 {
   optional int32 Eew = 1 [default = 0];
}

Forms of type serialization in protobuf-net

I would say there are five fundamental kinds of [de]serialization that protobuf-net supports on a type-by-type basis (not including primitive types):
  1. Normal serialization. In this mode, a standard protocol buffer is written, with one field in the protocol buffer for each field or property that you have marked with ProtoMember, or that has been auto-selected by ImplicitFields. During deserialization, the default constructor is called by default, but this can be disabled. I heard protobuf-net lets you deserialize into readonly fields (?!?), which should allow you to handle many cases of immutable objects.
  2. Collection serialization. If protobuf-net identifies a particular data type as a collection, it is serialized using this mode. Thankfully, collection types do not need any ProtoContract or ProtoMember attributes, which means you can serialize types such as List<T> and T[] easily. (I don't know how protobuf-net will react if any such attributes are present on your "collection" class). I heard that dictionaries are supported, too.
  3. Auto-tuple serialization. Under certain conditions, protobuf-net can deserialize an immutable type that has no ProtoContract attribute by calling its non-default constructor (constructor with more than zero arguments). Luckily this is fully documented at the link. This feature automatically applies to System.Tuple<...> and KeyValuePair.
  4. String ("parsable") serialization. Protobuf-net can serialize a class that has a static Parse() method. It calls ToString() to serialize, and then Parse(string) to deserialize the string. However, this feature is disabled by default. Enable it by setting RuntimeTypeModel.Default.AllowParseableTypes = true. I don't know whether [ProtoContract] disables the feature.
  5. "Surrogate" serialization, which is very useful for "closed" types that you are not allowed to modify, such as BCL (standard library) types. Instead of serializing a type directly, you can designate a user-defined type as a surrogate using RuntimeTypeModel.Default.Add(typeof(ClosedType), false) .SetSurrogate(typeof(SurrogateType)). To convert between the original type and the surrogate type, protobuf-net looks for conversion operators on the surrogate type (public static implicit operator ClosedType, public static explicit operator SurrogateType); implicit and explicit both work. The conversion operator is invoked even if the "object" to be converted is null.
Let's talk a little about deserialization. Because in order to do that, it needs an object. There are three ways it can get an object:
  1. "Like XML serialization". In this mode, which is the default, your class must have a default constructor. Before starting to deserialize, your default (meaning zero-argument) constructor is called. However, unlike XML serialization, protobuf-net can call a private constructor.
  2. "Like standard serialization". The option [ProtoContract(SkipConstructor = true)] will deserialize without calling the constructor, like in standard serialization. Magic! If you add ImplicitFields = ImplicitFields.AllFields, protobuf-net behaves even closer to standard serialization.
  3. Apparently you can deserialize into an object that you created yourself, at least at the root level. Call the (non-static) RuntimeTypeModel.Deserialize(Stream, object, Type) method (e.g. RuntimeTypeModel.Default.Deseralize()). I don't know why it needs both an object and a Type. And perhaps it can do the same trick for sub-objects, I'm not sure. Anyway this is handy if you want to deserialize, examine, and discard several objects in a row; you can avoid creating unnecessary work for the garbage collector.
It should be noted that some techniques that protobuf-net supports are only available when using the full .NET framework in full-trust mode. As mentioned here, for example, if you're using the precompiler then you won't be able to deserialize into private readonly fields. Personally I am using the full .NET framework, and I don't know what gotchas lie in wait in other environments.

Protobuf-net collection handling

  • Protobuf-net serializes a collection using a "repeated" field (in protocol buffer lingo). Therefore, you should be able to safely change collection types between versions. For example, you can serialize a Foo[] and later deserialize it into a List.
  • I don't know exactly how protobuf-net decides whether a given type is a collection. Wild guess: it might look for the IEnumerable<T> interface and an Add method?
  • If you serialize a class that contains an empty but non-null collection, protobuf-net does not seem to distinguish an empty collection from null. When you deserialize the object, protobuf-net will leave the collection equal to null (or if your constructor created a collection, it will remain created).
  • By default, protobuf-net "appends" to a collection rather than replacing one. So if you write a default constructor (without suppressing it) that initializes an int[] array to 10 items, and then deserialize a buffer with 10 items, you'll end up with an array of 20 items. Oops. Use the [ProtoMember(N, OverwriteList=true)] option to replace the existing list (if any) instead (or use SkipConstructor).

Random facts and gotchas

  • Protobuf.net's precompiler let's you use protobuf-net on platforms where run-time code generation or reflection are not available. It can also be used on the full .NET framework to avoid some run-time work.
  • By default, protobuf-net will accept [XmlType] and [XmlElement(Order = N)] in place of [ProtoContract] and [ProtoMember(N)], which is nice if you're already using XML serialization or if you want to avoid explicitly depending on protobuf-net. Similarly, it accepts the WCF attributes [DataContract] and [DataMember(Order = N)]. The Order option is required for protobuf support.
  • [XmlInclude] and [KnownType] cannot be used in place of [ProtoInclude] because they do not have an integer parameter to use as the tag number.
  • Tag values must be positive. [ProtoMember(0)] is illegal.
  • Very useful: print the result of RuntimeTypeModel.GetSchema(typeof(Root)) to find out what protocol buffers will be used to represent your data (e.g. RuntimeTypeModel.Default.GetSchema). Note: I assume this will be the actual schema used for serialization, but the documentation says merely that it will "Suggest a .proto definition".
  • A constructor with a default argument like Constr(int x = 0) is not recognized as a parameterless constructor.
  • The SkipConstructor and ImplicitFields options are not inherited, and probably other options are not inherited either. So, for example, if you use SkipConstructor on a base class, the constructor of the derived class is still called (and, by implication, the base class constructor).
  • You may have noticed the "Visual Studio 2008 / 2010 support" download package on the download page, but what the heck is it for? I haven't tried it myself, but based on this blog entry, I suspect it's a tool for generating C# code from a .proto file automatically.
  • Sometimes protobuf-net can serialize something but not deserialize it; be sure to test both directions.
  • RuntimeTypeModel.DeepClone() is a handy way to test whether serialization and deserialization both work. This method will typically serialize the object and immediately deserialize it again.
  • I suspect you can use the standard [NonSerialized] attribute on a field or property, as an alternative to [ProtoIgnore].
  • When serializing a sub-object, protobuf-net can either write a length-prefixed buffer, or if the field that contains the sub-object has the [ProtoMember(N, DataFormat = DataFormat.Group)] option, it can use what Marc Gravell calls a "group-delimited" record, which avoids the overhead of measuring the record size in advance. Google calls this feature "groups", but it has deprecated the feature and AFAICT, removed any documentation about it that might have existed in the past.
  • In [ProtoMember], the default value of IsRequired is false. I can only assume it means the same as required in a .proto file. I'm guessing that a required field is always written and that there will be some sort of exception if a required field is missing during deserialization.
  • Protobuf-net supports Nullable<T>. I heard that a value of type int? or any other nullable type will simply not be written to the stream if it is null (therefore, I have no idea what happens if you use IsRequired=true on a nullable field--and that includes a reference-typed field.)
  • Protobuf-net will (reasonably) refuse to serialize a property that does not have a setter, saying "Unable to apply changes to property". It will, however, serialize a property whose setter is private, and call that setter during deserialization.
  • If you download the source code of protobuf-net via subversion, you'll have access to the Examples project, which demonstrates various ways to use the library.

Serializing without attributes

If you can't change an existing class to add [ProtoContract] and [ProtoMember] attributes, you can still use protobuf-net with that class. But before I tell you how, it should be mentioned that protobuf-net's configuration is stored in a class called RuntimeTypeModel (in the Protobuf.Meta namespace). There is one global model, RuntimeTypeModel.Default, and you can create additional models with the static method TypeModel.Create(). This makes it possible to serialize the same class in different ways, using different protocol buffers in the same program.

Suppose you have a RuntimeTypeModel object called model. Then model.Add(typeof(C), true) creates a configuration for type C, represented by a MetaType object. You can also call model[typeof(C)] to get or create a MetaType, although I'm not sure what the relationship is between model[type] and model.Add(type, flag) even after decompiling both of them.
  • Call model.Add(typeof(C), false).SetSurrogate(typeof(S)) to establish S as a substitute for C during serialization. If a surrogate is used, all other options for C are ignored (the options for S are used instead). If not using a surrogate, you should probably use model.Add(typeof(C), true) instead although I'm uncertain what the true flag actually does, whether it's equivalent to [ProtoContract] or does something else.
  • model[type].Add(7, "Foo").Add(5, "Bar") is equivalent to the attributes [ProtoMember(7)] on the field/property Foo, and [ProtoMember(5)] on the field/property Bar.
  • model[type].Add("Fizz", "Buzz", ...) assigns tag numbers sequentially starting from 1 or, if some tag numbers exist already, from the highest existing tag number plus one. So if the type has no fields defined yet, Fizz will be #1 and Buzz will be #2.
  • model[type].AddSubType(100, typeof(Derived)) is equivalent to [ProtoInclude(100, typeof(Derived))]. Example here.
Finally, call model.Serialize(Stream, object) or model.Deserialize(Stream, object, Type) (or another overload).

Data types

The following types illustrate how protobuf-net maps primitive types to protocol buffers:

C#
[ProtoContract]
class DefaultRepresentations
{
  [ProtoMember(1)] int Int;
  [ProtoMember(2)] uint Uint;
  [ProtoMember(3)] byte Byte;
  [ProtoMember(4)] sbyte Sbyte;
  [ProtoMember(5)] ushort Ushort;
  [ProtoMember(6)] short Short;
  [ProtoMember(7)] long Long;
  [ProtoMember(8)] ulong Ulong;
  [ProtoMember(9)] float Float;
  [ProtoMember(10)] double Double;
  [ProtoMember(11)] decimal Decimal;
  [ProtoMember(12)] bool Bool;
  [ProtoMember(13)] string String;
  [ProtoMember(14)] DayOfWeek Enum;
  [ProtoMember(15)] byte[] Bytes;
  [ProtoMember(16)] string[] Strings;
  [ProtoMember(17)] char Char;
}
.proto
message DefaultRepresentations {
  optional int32 Int = 1 [default = 0];
  optional uint32 Uint = 2 [default = 0];
  optional uint32 Byte = 3 [default = 0];
  optional int32 Sbyte = 4 [default = 0];
  optional uint32 Ushort = 5 [default = 0];
  optional int32 Short = 6 [default = 0];
  optional int64 Long = 7 [default = 0];
  optional uint64 Ulong = 8 [default = 0];
  optional float Float = 9 [default = 0];
  optional double Double = 10 [default = 0];
  optional bcl.Decimal Decimal = 11 [default=0];
  optional bool Bool = 12 [default = false];
  optional string String = 13;
  optional DayOfWeek Enum = 14 [default=Sunday];
  optional bytes Bytes = 15;
  repeated string Strings = 16;
  optional uint32 Char = 17 [default = 

(there's a bug in GetSchema; the output is truncated after a field of type char.)

You can look at the protocol buffer documentation, particularly Encoding, for more information. All of the integer types use the varint wire format. Because protobuf-net uses int32/int64 rather than sint32/sint64, negative numbers are stored inefficiently, but you can use the DataFormat option to choose a different representation, as shown below. sint32/sint64 (called ZigZag in protobuf-net) are better for fields that are often negative; sfixed32/sfixed64 (FixedSize in protobuf-net) are better for numbers have large magnitude most of the time (or if you just prefer a simpler storage representation).

By the way, string and byte[] ("bytes" in the .proto) both use length-prefixed notation. Sub-objects (aka "messages") also use length-prefixed notation (is this documented anywhere?) unless you use the "group" format.

C#
[ProtoContract]
class ExplicitRepresentations
{
  [ProtoMember(1, DataFormat = DataFormat.TwosComplement)] int defaultInt;
  [ProtoMember(2, DataFormat = DataFormat.TwosComplement)] int defaultLong;
  [ProtoMember(3, DataFormat = DataFormat.FixedSize)] int fixedSizeInt;
  [ProtoMember(4, DataFormat = DataFormat.FixedSize)] long fixedSizeLong;
  [ProtoMember(5, DataFormat = DataFormat.ZigZag)] int zigZagInt;
  [ProtoMember(6, DataFormat = DataFormat.ZigZag)] long zigZagLong;
  [ProtoMember(7, DataFormat = DataFormat.Default)] SubObject lengthPrefixedObject;
  [ProtoMember(8, DataFormat = DataFormat.Group)] SubObject groupObject;
}
[ProtoContract(ImplicitFields=ImplicitFields.AllFields)]
class SubObject { string x; }

.proto
message ExplicitRepresentations {
   optional int32 defaultInt = 1 [default = 0];
   optional int32 defaultLong = 2 [default = 0];
   optional sfixed32 fixedSizeInt = 3 [default = 0];
   optional sfixed64 fixedSizeLong = 4 [default = 0];
   optional sint32 zigZagInt = 5 [default = 0];
   optional sint64 zigZagLong = 6 [default = 0];
   optional SubObject lengthPrefixedObject = 7;
   optional group SubObject groupObject = 8;
}
message SubObject {
   optional string x = 1 [default = 0];
}

By the way, since Google doesn't seem to document the "group" format, I'll show you how the two sub-object formats look in binary:

[ProtoContract]
class SubMessageRepresentations
{
  [ProtoMember(5, DataFormat = DataFormat.Default)] 
  public SubObject lengthPrefixedObject;
  [ProtoMember(6, DataFormat = DataFormat.Group)]
  public SubObject groupObject;
}
[ProtoContract(ImplicitFields=ImplicitFields.AllFields)]
class SubObject { public int x; }
/*
message SubMessageRepresentations {
   optional SubObject lengthPrefixedObject = 5;
   optional group SubObject groupObject = 6;
}
message SubObject {
   optional int32 x = 1 [default = 0];
}
*/

using (var stream = new MemoryStream()) {
  _pbModel.Serialize(
    stream, new SubMessageRepresentations {
      lengthPrefixedObject = new SubObject { x = 0x22 },
      groupObject = new SubObject { x = 0x44 }
    });
  byte[] buf = stream.GetBuffer();
  for (int i = 0; i < stream.Length; i++)
    Console.Write("{0:X2} ", buf[i]);
}
// Output: 2A 02 08 22 33 08 44 34 

Vesioning

  • You can safely remove a serialized field (or a derived class) between versions of your software. Protobuf-net simply silently drops fields that are in the data stream but not in the class. Just be careful not to use the removed tag number again.
  • You must not change the value of AsReference (or AsReferenceDefault) between versions.
  • It's handy to save the results of GetSchema (mentioned above) to keep track of your old versions. You can then "diff" two schema versions to spot potential incompatibilities.
  • Don't change the storage representation of an integer between versions; your numbers will be silently messed up if you migrate between TwosComplement and ZigZag. In theory I think protobuf-net could handle a change from FixedSize to TwosComplement or vice-versa, but I don't know if it actually can. Similarly, a change from FixedSize int to FixedSize long could work in theory, but I don't know about practice.
  • On the other hand, you can increase the size of a non-FixedSize integer, e.g. byte to short or int to long, since the wire format is unchanged.
  • The Google docs have more information relevant to versioning.

How references work

When serializing a class Foo to which AsReference or AsReferenceDefault applies, the type of the field in the protocol buffer changes from Foo to bcl.NetObjectProxy, which is defined as follows in the source code of protobuf-net (Tools/bcl.proto):
message NetObjectProxy {
  // for a tracked object, the key of the **first** 
  // time this object was seen
  optional int32 existingObjectKey = 1; 
  // for a tracked object, a **new** key, the first 
  // time this object is seen
  optional int32 newObjectKey = 2;
  // for dynamic typing, the key of the **first** time 
  // this type was seen
  optional int32 existingTypeKey = 3; 
  // for dynamic typing, a **new** key, the first time 
  // this type is seen
  optional int32 newTypeKey = 4;
  // for dynamic typing, the name of the type (only 
  // present along with newTypeKey) 
  optional string typeName = 8;
  // the new string/value (only present along with 
  // newObjectKey)
  optional bytes payload = 10;
}
So it appears that

  • The first time an object is encountered, the newObjectKey and a payload fields are written; presumably, the payload is stored as if its type is Foo.
  • When the object is encountered again, just the existingObjectKey is written.
I don't know what that "dynamic typing" business is about.

Strings, of course, are a reference type. Read here about how protobuf-net handles strings.

More stuff I don't know yet

  • I don't know if protobuf-net is capable of deserializing a field of type object. By default, it can't.
  • I don't know how protobuf-net deserializes fields of an interface type, e.g. IEnumerable<T> or IFoo. (Serializing is easy, of course, it just uses GetType() to learn the type.)
  • I don't know whether the root object is allowed to be a collection.
  • I don't know how to run "prep" code before serialization of a particular type, or validation/cleanup code after an object is deserialized, but I do know that some sort of "callback" mechanism exists for this purpose.

Other sources of information: