Saturday, February 22, 2014

# in operator names to be removed

Originally I decided that in Loyc trees, '#' would be a prefix that meant "special" or "keyword". So reserved words or language constructs would be represented as identifiers that start with '#', e.g. #class, #enum, #for, #if, #while, etc.

Operators seemed special, so I also decided that operators would have the same prefix: #+, #==, #>>, etc. I've changed my mind, though, for two reasons:
  1. I no longer think operators are so special. LES and some other languages have an infinite number of potential operators and they're treated identically to normal functions, apart from their syntax.
  2. No language I know of uses the specific prefix '#'
  3. Most importantly, the syntax for naming operators turned out to be cumbersome: because an operator name like #+ generally cannot be used directly as an identifier in most languages, some sort of escaping mechanism is needed. In LES and EC# you have to write, in general, @`special identifier` (not just `special identifier` because the latter is treated as an operator, not an identifier). Therefore the identifier #+ is written @`#+` (@#+ also works in LES), which is a bit cumbersome and (more importantly IMO) creates a teaching challenge: I have to explain both what # is for and also what @ and the backticks are for, and the student must remember not to mix up @ and #. Plus, when someone encounters LES or EC# code for the first time it'll be harder to guess what @`#+` means compared to @`+`, and Google won't help ("Your search - @`#+` - did not match any documents.")
I've seen a few languages that allow you to use operators like identifiers. The conventions I've seen are: C#/C++ operator+, Haskell (+), and Nemerle @+. In all of these cases the additional characters near + can be understood as an escaping mechanism, so that the actual identifier name is simply +.

Therefore, I will soon dedicate a day or two to removing the hash characters from the beginning of all operators in both LES and EC#. Possibly, I'll remember to update the documentation too. We'll see.

Monday, February 3, 2014

Using Loyc.Essentials: collection interfaces

Loyc.Essentials is a library of "stuff that should be in the .NET BCL, but isn't." Today I will present one of the most important parts of Loyc.Essentials: its collection interfaces.

Programmers with less experience may underestimate the importance of data structures and collection interfaces in software engineering. I have come to believe that collection interfaces are very important, and that the built-in collection interfaces of the .NET framework are somewhat impoverished and inflexible. I like to work with a variety of data structures and I like to have a variety of collection interfaces available to reflect that variety. There's so much more out there than just List<T>: linked lists, B+ trees, VLists, ALists, Hash Array Mapped Tries, Bloom filters, other search trees and tries... the possibilities are endless.

First, let me talk a little about the limitations that the current BCL interfaces suffer from.

The .NET framework does not support much diversity. Traditionally it defined just four collection interfaces: IEnumerable<T>, ICollection<T>, IList<T>, and IDictionary<T>, and all but one were mutable.

Flaw #1: IList<T>/ICollection<T> contain some methods that are almost never used, like CopyTo(), but lack methods that are often desired, like AddRange() and RemoveRange(), and they lack methods that would be often used if they existed, like Slice().

Flaw #2: Lack of support for read-only interfaces. In .NET 4.5, they finally introduced IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T>, but they didn't derive IList<T> from IReadOnlyList<T> and they didn't declare that the core collection classes actually implement these interfaces. Plus, "IReadOnlyCollection" is an terribly long name for simply adding a Count property to IEnumerable (Loyc.Essentials used to define exactly the same interface, but it was called IEnumerableCount<T>).

Ideologically, I'm opposed to the idea that a mutable interface such as IList<T> should not implement IReadOnlyList<T>. If a method takes an argument of type IReadOnlyList<T>, it should mean that the method will not modify the list; it should not mean that the caller has to create a special read-only wrapper object on the heap, and it certainly should not mean that the caller has to guarantee that the list will never change. IReadOnlyList<T> should not be a burden or difficulty that the caller must overcome, rather it should simply be a promise that the called method will not modify the list. That leads me to flaw #3...

Flaw #3: because IReadOnlyList<T> and IList<T> are officially unrelated, and IReadOnlyList<T> is not supported very well, if a data structure implements both interfaces, C# "ambiguity" errors are inevitable. Here's how it typically happens: you want to declare a method that takes a list but doesn't modify it:
  void Foo(IReadOnlyList<T> list) {...}
But you notice that most collection classes don't implement IReadOnlyList<T>, so you add an overload that automatically converts IList<T> to IReadOnlyList<T> with an extension method (e.g. AsListSource() in the Loyc.Collections namespace):
  // Alias for Foo(IReadOnlyList<T>)
  void Foo(IList<T> list) { Foo(list.AsListSource()); }
So far, so good. But as new data structures are implemented, they will naturally implement IReadOnlyList<T>. If they are mutable, they will want to implement IReadOnlyList<T> so that it will be possible to pass them to methods that accept IReadOnlyList; meanwhile, immutable data structures might choose to explicitly implement IList<T> in order to stay compatible with the large amount of older code based on IList<T> for read-only access. So what's the problem? Suppose I have a data structure MyList<T> that implements both interfaces. Look what happens when I call Foo:
Foo(new MyList<T>()); // ERROR: The call is ambiguous 
  // between the following methods or properties: 
  // '...Foo(System.Collections.Generic.IList<T>)' and 
  // '...Foo(System.Collections.Generic.IReadOnlyList<T>)'
Flaw #4: A lack of variety and flexibility. If you actually survey the various data structures that exist, you'll find a wide variety of capabilities that they may or may not offer.
  • A data structure might support an indexing operator but not a Count property (consider the Sieve of Eratosthenes of infinite size). Other data structures may support a Count property in theory, but calling it should be avoided because it may be expensive (in Loyc.Essentials, BufferedSequence<T> has this property).
  • A data structures may be queue- or stack-like, offering access to the beginning/end of a data structure but not the middle.
  • A data structure might not be zero-based, e.g. the range of valid indexes could be 1..100 or -100..100 or 35..39.
  • A data structure may be sparse, in which case scanning it for items with a for-loop from 0..Count is inefficient.
  • A data structure may be a sink, allowing insertion of data but not reading (write-only files and pipes may be viewed this way).
  • Most list data structures could allow enumeration to start at any point in the middle, but .NET ignores this possibility; in addition, indexing is less efficient than enumeration for some lists (e.g. AList), so this should be supported.
  • A data structures may not support access by integer index, but could still allow bi-directional enumeration moving both forward and backward), e.g. a doubly-linked list, with insertion locations denoted by enumerator positions.
  • Many data structures can easily tell you if they are empty, but require O(N) time to report their Count. .NET almost forces you to use Count == 0 to detect emptiness; you can use LINQ's Any() function instead, but this requires a heap allocation and two interface calls.
This list is certainly incomplete; the possible combinations of operations that data structures may or may not support is just about endless.

Other collection libraries outside .NET have recognized this diversity, and provide more concepts or interfaces to address it, such as the C++ STL's various iterator and container concepts, or D's ranges.

Suppose I write a method that reads a list starting at the outside and moving inward (e.g. first element, last element, second element, second-last element, etc.). A linked list can certainly support this order of traversal; unfortunately, in .NET, the only standard collection interface that can be used in this situation is IList<T>, which a linked list data structure cannot support efficiently. Thanks to .NET's simplistic interfaces, my method can either support standard linked lists, or normal lists, but not both at once.

This is the central problem that the .NET interfaces fail to solve. When you write a method or a class that operates on a collection, you cannot declare precisely what kind of collection you accept, so most of the time you have to mandate functionality you don't need. Some collection types will not be able to provide that functionality that you mandated but didn't need, so they won't implement the interface and will be incompatible, unnecessarily. Meanwhile, other collection types might implement the interface even though they are not compatible, and then throw an exception when you try to use unsupported functionality.

How Loyc.Essentials improves the situation

Loyc.Essentials improves the situation mainly by providing a much wider variety of interfaces. The new interfaces mostly address flaw #4, but can only partly alleviate the pain of flaws 1, 2 and 3.

Interfaces are cheap to define: they are only a few lines of code plus some documentation, so it's well worth the effort of defining a variety of them, even if no implementations are provided (too bad Microsoft didn't see it that way). Before I get to the actual interfaces, here's a list of the key concepts that are supported:
  • Sources: objects that provide data. Source interfaces are read-only.
  • Sinks: objects that accept data. Sink interfaces are write-only.
  • Slices: subsections of an indexed list, e.g. elements 5..10 of a larger list. The broad class of Divide-and-conquer algorithms benefit from these.
  • Ranges: Similar to slices, ranges represent a subsection of a collection, but they may only allow access to the first, or first and last, elements of that subsection. For performance reasons, ranges are mutable in the sense that you can shrink the range one element at a time. This does not modify the underlying collection, it just causes the range to point to a smaller area in the same collection. Ranges were popularized by D, and while they don't work as well in .NET, I thought that the interfaces were still worth providing. Here is a long-ish article about my design of Ranges in .NET.
  • Fancy enumerators: after designing the range interfaces, I realized that often all you need is an enumerator that can go backward. Binumerators can travel both backward and forward through a collection, and mutable variants allow the current item to be changed or deleted.
  • Specific categories of data structures: queues, arrays, and sparse lists.
  • "Neg" non-zero-based lists: indexed lists for which the minimum index is not necessarily zero.

The Loyc.Essentials collection interfaces

The full documentation of these interfaces can be seen in the source code and will be provided automatically by Visual Studio Intellisense, as long as you have the Loyc.Essentials.xml file alongside your copy of Loyc.Essentials.dll.

First, an interface you won't use: ICount:
  public interface ICount : IIsEmpty
  {
    int Count { get; }
  }
I thought the concept of "having a count" should be separated because some collections may have a count but do not allow enumeration or vice versa. For example, I defined IStack<T> as a collection that had a Count but did not allow enumeration. However, Microsoft made ICount unusable in .NET 4.5 by defining IReadOnlyCollection:
  public interface IReadOnlyCollection : IEnumerable
  {
    int Count { get; }
  }
How does this new interface make it impossible to use ICount? Well, if I define any interface that inherits from both IReadOnlyCollection and ICount, it becomes impossible (well, quite difficult) to call Count on that interface because the C# compiler says that the reference to Count is "ambiguous". The end result is that people won't bother using ICount at all (admittedly, the number of cases where somebody needs the Count of a collection, and nothing else, is very small). For completeness, I recently added IIsEmpty, because as I mentioned, IsEmpty can run much faster than Count in some data structures:
  public interface IIsEmpty
  {
    bool IsEmpty { get; }
  }
However, I decided not to mandate this property as part of the common Loyc.Essentials interface IListSource<T>, although it is a part of IFRange<T> (listed near the bottom of this article).

My single most favorite interface in Loyc.Essentials is IListSource<T>. As its name implies, it is a kind of source; it's basically IReadOnlyCollection<T> with extra functionality:
  public interface IListSource<out T> : IReadOnlyList<T>
  {
    // If index is invalid, sets fail=true and returns default(T)
    T TryGet(int index, out bool fail); 
    IRange<T> Slice(int start, int count = int.MaxValue);
  }
Both of these new functions are favorites of mine, although it's usually easier to call one of the standard TryGet() extension methods (especially the second one):
  public static partial class LCInterfaces
  {
    public static bool TryGet<T>(this IListSource<T> list, 
                                 int index, ref T value);
    public static T TryGet<T>(this IListSource<T> list,
                              int index, T defaultValue);
  }
Plus, a couple of standard methods of IList<T> are added as extension methods:
  public static partial class LCInterfaces
  {
    public static int IndexOf<(this IReadOnlyList<T> list, T item)
    public static void CopyTo<T>(this IReadOnlyList<T> c, T[] array, int arrayIndex)
  }
And there are other extension methods that will be covered in a future article.

I don't know about you, but in my code I often have to check whether the index is in range before I call the indexer:
  if (index < list.Count && list[index] >= 0) {...}
With IListSource this is easier:
  if (list.TryGet(index, -1) >= 0) {...}
In theory, this version should be faster too, because it needs only one interface dispatch, not two. That's not the only reason for TryGet() to exist, though; there are a few collection types for which it is expensive to call Count. Why is it "ref bool fail" rather than "out bool fail"? Well, I'm considering whether to change it to "out". My thinking was that on the usual case of success, the collection can save a nanosecond by not setting fail=false, requiring the caller to set fail=false instead (this only makes a difference in a loop). When performance doesn't matter, you will typically call one of the extension methods instead. The first extension method uses "ref T" for a completely different reason; this allows the caller to set a default T value before calling the method, since you will not always want to use default(T) as the default.

Meanwhile, the Slice() method returns a sub-section of the list without allocating a new list. For example, list.Slice(10, 10) gets the collection of 10 items starting at index 10. IRange<T> is a superset of IListSource<T>; see definition below. Don't worry, if you're creating a collection class, you never actually have to write a class that implements IRange<T>. Just use this implementation:
  public IRange<T> Slice(int start, int count) {
    return new Slice_<T>(this, start, count);
  }
or the potentially more efficient version,
  IRange<T> IListSource<T>.Slice(int start, int count)
    { return Slice(start, count); }
  public Slice_<T> Slice(int start, int count)
    { return new Slice_<T>(this, start, count); }
Slice_ is a struct in Loyc.Essentials that provides a "view" on part of a read-only list. Why the underscore? I wanted to call it simply "Slice", but that is illegal in C# because Slice_ itself implements IListSource<T>, so it contains a method named "Slice", and a method cannot have the same name as its containing class. By the way, there's also a ListSlice<T> class and Slice() extension method for slicing IList<T>.

Next up, here are the "neg lists". These are list interfaces that do not (necessarily) use zero as the minimum index. So far, I haven't used these interfaces in practice.
  public interface INegListSource<T> : IReadOnlyCollection<T>
  {
    int Min { get; }
    int Max { get; }
    T this[int index] { get; }
    T TryGet(int index, out bool fail);
    IRange<T> Slice(int start, int count = int.MaxValue);
  }
  public interface INegArray : INegListSource
  {
    new T this[int index] { set; get; }
    bool TrySet(int index, T value);
  }
  public interface INegAutoSizeArray<T> : INegArray<T>
  {
    void Optimize();
  }
  public interface INegDeque<T> : INegArray<T>, IDeque<T> 
  {
  }
Here's documentation, if you're interested. INegAutoSizeArray is interesting: it automatically enlarges itself when you write to an index below Min or above Max.

Next there's a higher-performance variation on INotifyCollectionChanged, which can only be implemented by collections that implement IListSource<T>:
  public interface INotifyListChanging<T>
  {
    event ListChangingHandler<T> ListChanging;
  }
  public delegate void ListChangingHandler<T>(
    IListSource<T> sender, ListChangeInfo<T> args);
  [Serializable]
  public class ListChangeInfo<T> : EventArgs
  {
    public ListChangeInfo(NotifyCollectionChangedAction action,
      int index, int sizeChange, IListSource<T> newItems);

    public readonly NotifyCollectionChangedAction Action;
    public readonly int Index;
    public readonly int SizeChange;
    public readonly IListSource<T> NewItems;
  }
Why is this faster? Because unlike INotifyCollectionChanged, this interface requires as little as one memory allocation each time the list changes (for the ListChangeInfo object). INotifyCollectionChanged requires the collection to allocate a list of OldItems and NewItems; but because ListChanging fires before the list changes rather than afterward, there is no need to allocate a list of old items--the event handler can simply look at the current state of the list. Meanwhile, it is possible to optimize NewItems not to require an allocation.

Admittedly, I have found that sometimes an event handler would really prefer to see the list after it changes rather than before. But if performance matters, this interface may be better. You be the judge.

Next I present this peculiar bunch of interfaces, which reflects the difficulty I had designing an interface for immutable sets:
  public interface ISetTests<SetT>
  {
    bool IsProperSubsetOf(SetT other);
    bool IsProperSupersetOf(SetT other);
    bool IsSubsetOf(SetT other);
    bool IsSupersetOf(SetT other);
    bool Overlaps(SetT other);
    bool SetEquals(SetT other);
  }
  public interface ISetOperations<in T, SetT> 
    : ISetOperations<T, SetT, SetT> { }
  public interface ISetOperations<in T, in InSetT, out OutSetT>
  {
    OutSetT With(T item);
    OutSetT Without(T item);
    OutSetT Union(InSetT other);
    OutSetT Intersect(InSetT other);
    OutSetT Except(InSetT other);
    OutSetT Xor(InSetT other);
  }
  public interface ISetImm<T, SetT> : ISetOperations<T, SetT>, 
    ISetTests<SetT>, IReadOnlyCollection<T> //, ICount
  {
    bool IsInverted { get; }
  }
  public interface ISetImm<T> : ISetOperations<T, IEnumerable<T>>, 
    ISetTests<IEnumerable<T>>, IReadOnlyCollection<T>
  {
  }
The most important interface is at the end: ISetImm<T>. If you work through it logically, you can see that ISetImm<T> contains the following members:
public interface ISetImm<T> : ...
{
  bool IsProperSubsetOf(IEnumerable<T> other);
  bool IsProperSupersetOf(IEnumerable<T> other);
  bool IsSubsetOf(IEnumerable<T> other);
  bool IsSupersetOf(IEnumerable<T> other);
  bool Overlaps(IEnumerable<T> other);
  bool SetEquals(IEnumerable<T> other);
  ISetImm<T> With(T item);
  ISetImm<T> Without(T item);
  ISetImm<T> Union(IEnumerable<T> other);
  ISetImm<T> Intersect(IEnumerable<T> other);
  ISetImm<T> Except(IEnumerable<T> other);
  ISetImm<T> Xor(IEnumerable<T> other);
  bool IsInverted { get; }
}
.NET already defines a set interface for mutable sets, ISet<T>, but immutable sets are more convenient to work with so I added this interface. In Loyc.Collections.dll, I implemented a very nice immutable set class called Set<T>, which is a kind of Hash Tree. Originally my Set<T> structure implemented only ISetImm<T, Set<T>>, because it was easier, but eventually I also implemented the more generic ISetImm<T> interface. I also made a mutable variant of Set<T>, the MSet<T> class, which implements both ISet<T> and ISetImm<T>.

Finally, I created an InvertibleSet<T> class, which can represent everything that is not in some other set; to account for this possibility, I added the IsInverted property to ISetImm.

Sparse lists are indexed lists in which regions of indexes may be unused. Here are the sparse list interfaces I defined:
  public interface ISparseListSource<T> : IListSource<T>
  {
    int? NextHigher(int index);
    int? NextLower(int index);
    bool IsSet(int index);
  }
  public partial class LCInterfaces
  {
    public static IEnumerable<KeyValuePair<int, T>> Items<T>(
      this ISparseListSource<T> list);
  }
  public interface ISparseList<T>
    : ISparseListSource<T>, IListAndListSource<T>
  {
    void Clear(int index, int count = 1);
    void InsertSpace(int index, int count = 1);
  }
  public interface ISparseListEx<T> 
    : ISparseList<T>, IListEx<T>, IAutoSizeArray<T>
  {
    void InsertRange(ISparseListSource<T> list);
  }
ISparseListSource is a read-only sparse list, while ISparseList is mutable and ISparseListEx additionally allows you to insert the contents of one entire sparse list into another. According to the definitions shown here, an ISparseList<T> is fundamentally different from an IDictionary<int, T> because when you insert or remove an item from a sparse list at index i, the index associated with each and every item above index i also changes. A Dictionary can't do that, while a good sparse list implementation can do it efficiently.

The motivating example that led me to create these interfaces is the concept of anchor points in a text editor, such as the locations of error messages, breakpoints, and syntax highlighting markers. Clearly, this data is sparse: the majority of characters do not have an associated error message or color change, so ideally we should store this information in some sort of "compressed" data structure. As the user types new text in the middle of the file, all of these associated locations that appear in the second half should be shifted; that's what a sparse list does. Loyc.Collections offers one implementation if ISparseList<T>, called SparseAList<T>. It is not always a fast data structure, but it scales up effectively to very large files.

Next up, here are the IQueue, IStack and IDeque interfaces, with associated extension methods for added convenience, which are mostly self-explanatory if you know what queues and stacks are:
  public interface IPush<in T>
  {
    void Push(T item);
  }
  
  public interface IPop<out T>
  {
    T TryPop(out bool isEmpty);
    T TryPeek(out bool isEmpty);
    bool IsEmpty { get; }
  }
  public static partial class LCInterfaces
  {
    public static T Pop<T>(this IPop<T> c);
    public static T Peek<T>(this IPop<T> c);
    public static bool TryPop<T>(this IPop<T> c, out T value);
    public static bool TryPeek<T>(this IPop<T> c, out T value);
    public static T TryPop<T>(this IPop<T> c);
    public static T TryPop<T>(this IPop<T> c, T defaultValue);
    public static T TryPeek<T>(this IPop<T> c);
    public static T TryPeek<T>(this IPop<T> c, T defaultValue);
  }
  
  public interface IQueue<T> : IPush<T>, IPop<T>, ICount
  {
  }
  public interface IStack<T> : IPush<T>, IPop<T>, ICount
  {
  }
  
  public interface IDeque<T>: ICount, IIsEmpty
  {
    void PushFirst(T item);
    void PushLast(T item);
    T TryPopFirst(out bool isEmpty);
    T TryPeekFirst(out bool isEmpty);
    T TryPopLast(out bool isEmpty);
    T TryPeekLast(out bool isEmpty);

    T First { get; set; }
    T Last { get; set; }
    bool IsEmpty { get; }
  }

  public static partial class LCInterfaces
  {
    public static T PopFirst<T>(this IDeque<T> c)
    public static T PopLast<T>(this IDeque<T> c)
    public static T PeekFirst<T>(this IDeque<T> c)
    public static T PeekLast<T>(this IDeque<T> c)
    public static bool TryPopFirst<T>(this IDeque<T> c, out T value)
    public static bool TryPopLast<T>(this IDeque<T> c, out T value)
    public static bool TryPeekFirst<T>(this IDeque<T> c, out T value)
    public static bool TryPeekLast<T>(this IDeque<T> c, out T value)
    public static T TryPopFirst<T>(this IDeque<T> c)
    public static T TryPopFirst<T>(this IDeque<T> c, T defaultValue)
    public static T TryPopLast<T>(this IDeque<T> c)
    public static T TryPopLast<T>(this IDeque<T> c, T defaultValue)
    public static T TryPeekFirst<T>(this IDeque<T> c)
    public static T TryPeekFirst<T>(this IDeque<T> c, T defaultValue)
    public static T TryPeekLast<T>(this IDeque<T> c)
    public static T TryPeekLast<T>(this IDeque<T> c, T defaultValue)
  }
Currently IQueue and IStack are not being used, but IDeque<T> is implemented by DList<T>. Next up, here are the sink interfaces, which let you put data in but not take it out.
  /// <summary>An interface for depositing items. Includes only an Add(T) method.</summary>
  public interface IAdd<in T>
  {
    void Add(T item);
  }
  /// <summary>Represents a write-only collection: you can modify it, but you
  /// cannot learn what it contains.</summary>
  public interface ISinkCollection<in T> : IAdd<T>
  {
    void Clear();
    bool Remove(T item);
  }
  /// <summary>Represents a write-only array.</summary>
  public interface ISinkArray<in T>
  {
    T this[int index] { set; }
  }
  /// <summary>Represents a write-only indexable list class.</summary>
  public interface ISinkList<in T> : ISinkCollection<T>, ISinkArray<T>
  {
  }
These interfaces are not often useful, but sometimes people actually do write code that only modifies a collection, without reading it. One useful property of these interfaces is contravariance: if a method needs an ISinkCollection<Foo> for example, you can pass it a ISinkCollection<object> instead.

Here's an interface that models built-in arrays:
  public interface IArray<T> : IListSource<T>, ISinkArray<T>
  {
    new T this[int index] { get; set; }
    bool TrySet(int index, T value);
  }
  public interface IAutoSizeArray<T> : IArray<T>
  {
    void Optimize();
  }
But of course, since T[] does not implement IArray, it's not very useful and is only included for completeness. The derived interface is a little more interesting: it represents a list whose size increases automatically when you write to an index beyond the end of the array. If supported, the Optimize() method optimizes the data structure to consume less storage space (if not supported, it has no effect.)

Next, here are some interfaces that add or remove items "in bulk":
  public interface IAddRange<T> : ICount
  {
    void AddRange(IEnumerable<T> e);
    void AddRange(IReadOnlyCollection<T> s);
  }
  public interface IListRangeMethods<T> : IAddRange<T>
  {
    void InsertRange(int index, IEnumerable<T> e);
    void InsertRange(int index, IReadOnlyCollection<T> s);
    void RemoveRange(int index, int amount);
  }
  public static partial class LCInterfaces
  {
    public static void Resize<T>(
      this IListRangeMethods<T> list, int newSize);
  }
It's often more efficient to use a bulk insertion or deletion method rather than calling Insert or RemoveAt in a loop.

Next up, here are some interfaces that are used to work around the dreaded ambiguity errors that are caused by the fact that ICollection<T> is officially unrelated to IReadOnlyCollection<T>, and the fact that IList<T> is officially unrelated to IReadOnlyList<T>:
  public interface ICollectionAndReadOnly<T> 
    : ICollection<T>, IReadOnlyCollection<T> { }
  public interface IListAndListSource<T> : IList<T>, 
    IListSource<T>, ICollectionAndReadOnly<T>  { }
Any collection that implements both ICollection<T> and IReadOnlyCollection<T> should implement ICollectionAndReadOnly<T>, and any collection that implements both IList<T> and IReadOnlyList<T> should go the extra mile and implement IListAndListSource<T>.

By itself, these interfaces don't solve any problems. To recap, if you write these two methods:
  void Foo(IReadOnlyList<T> list) {...}
  void Foo(IList<T> list) { Foo(list.AsListSource()); }
The C# compiler will give an "ambiguity" error when you try to call Foo(list), if the list implements both IList<T> and IReadOnlyList<T>. To work around this problem, it is necessary to define a third method that takes IListAndListSource<T>:
  void Foo(IReadOnlyList<T> list) {...}
  void Foo(IList<T> list) { Foo(list.AsListSource()); }
  void Foo(IListAndListSource<T> list) { Foo((IReadOnlyList<T>) list); }
This workaround only works if the list class implements IListAndListSource<T>, so all mutable collections in Loyc.Essentials and Loyc.Collections do so.

Next, here are some "enhanced" collection interfaces, which just add extra functionality to the standard interfaces. ICollectionEx combines ICollection with AddRange(list) and RemoveAll(lambda), while IListEx combines IList with ICollectionEx, InsertRange(index, list) and RemoveRange(index, count):
  public interface ICollectionEx<T> : ICollectionAndReadOnly<T>, 
      ISinkCollection<T>, IAddRange<T>, IIsEmpty { }
  {
    int RemoveAll(Predicate<T> match);
  }
  public interface IListEx<T> : IListAndListSource<T>, 
    ICollectionEx<T>, IArray<T>, IListRangeMethods<T> { }
Binumerators are enumerators that can move backward through a collection, not just forward. The IBinumerable interface can be implemented by a collection that supports binumerators. Mutable enumerators additionally allow elements of the collection to be modified during enumeration. Mutable enumerators optionally support removing the current element.
public interface IBinumerator<T> : IEnumerator<T>
{
  bool MovePrev();
}

interface IMEnumerator<T> : IEnumerator<T>
{
  /// <summary>Gets or sets the value of the current item.</summary>
  new T Current { get; set; }
  /// <summary>Removes the current item and moves to the next one. Remember
  /// NOT to call MoveNext() immediately after Remove().</summary>
  /// <returns>True if there is a next item after this one, 
  /// false if the removed item was the last one.</returns>
  /// <exception cref="NotSupportedException">The collection does not permit
  /// this operation.</exception>
  bool Remove();
}

interface IMBinumerator<T> : IBinumerator<T>, IMEnumerator<T>
{
}

interface IBinumerable<T>
{
  /// <summary>Returns a binumerator that points before the beginning of 
  /// the current collection.</summary>
  IBinumerator<T> Begin();
  /// <summary>Returns a binumerator that points after the end of the 
  /// current collection.</summary>
  IBinumerator<T> End();
}
Finally, here are the range interfaces, which were inspired by the D programming language. These were described in a much earlier article but I tweaked the interfaces after that article was written. You can see the current documentation in the source code; I won't say anything more about these interfaces here, since this article is getting pretty long already.
  public interface IFRange<out T> : IEnumerable<T>,
    ICloneable<IFRange<T>>, IIsEmpty
  {
    T Front { get; }
    T PopFront(out bool fail);
  }
  public interface IMFRange<T> : IFRange<T>
  {
    new T Front { get; set; }
  }
  public interface IBRange<out T> : IFRange<T>,
    ICloneable<IBRange<T>>
  {
    T Back { get; }
    T PopBack(out bool fail);
  }
  public interface IMBRange<T> : IBRange<T>, IMFRange<T>
  {
    new T Back { get; set; }
  }
  public interface IRange<out T> : IBRange<T>,
    IListSource<T>, ICloneable<IRange<T>>
  {
  }
  public interface IBRangeEx<R, T> : IBRange<T> 
    where R : IBRangeEx<R, T>, ICloneable<R>
  {
    IEnumerable<T> InnerList { get; }
    int SliceStart { get; }
    R Intersect(R other);
    R Union(R other);
  }
  public interface IBRangeEx<T> : IBRangeEx<IBRangeEx<T>, T>, 
    ICloneable<IBRangeEx<T>>
  {
  }
  public interface IRangeEx<R, T> : IRange<T>, 
    IBRangeEx<R, T> where R : IRangeEx<R, T>, ICloneable<R>
  {
  }
  public interface IRangeEx<T> : IRangeEx<IRangeEx<T>, T>, 
    ICloneable<IRangeEx<T>>
  {
  }
  public interface IMRange<T> : IMBRange<T>, IRange<T>
  {
    new T this[int index] { get; set; }
  }
That's it! That's all the general-purpose collection interfaces.

Performance: one flaw I didn't fix

It bothers me that the IEnumerator<T> interface requires two interface calls to retrieve each element of a collection: MoveNext() and Current. That's just a pointless waste of clock cycles. In the past, I defined an interface that fixes that problem by moving to the next item and returning it with a single method call. It was called Iterator<T> and it was a delegate (alternate possible design: a derived interface of IEnumerator<T>):
  public delegate T Iterator<out T>(ref bool fail);
Note that I really would have preferred to use the signature bool Iterator<T>(out T value) instead, but .NET this second version of the method is illegal. Why? Unfortunately, the CLR actually doesn't have the concept of an "out" parameter; to the CLR, "out" and "ref" are the same thing. Therefore, type T would have to be invariant, <out T> would not be allowed. The latter signature can still be provided as an extension method, at the cost of some performance.

I actually wrote a whole implementation of LINQ for Iterator and IIterable<T> (the counterpart to IEnumerable<T>), but then I decided to abandon the concept before I got around to benchmarking the LINQ implementation (early microbenchmarks showed that IIterable was modestly faster when called directly, but not when using the extension method).

Ultimately I decided that I didn't want to burden potential implementors of collection classes by including IIterable<T> in the interface hierarchy. that is, it seemed natural to derive other interfaces such as IListSource<T> from IIterable<, but this placed a substantial burden on my collection class implementations, and might have confused end-users. Obviously, if you're writing a collection class, you want to support things like foreach loops and traditional LINQ, so you will definitely implement IEnumerable<T>; IIterable<T> can only be a bonus, not the central mechanism of enumeration. Then you have the dillema of whether to optimize your code for IEnumerable<T> or IIterable<T>, plus there's the fact that C#'s iterator feature supports only IEnumerator, not Iterator. So I found that the implementation costs for Iterator tended to be high, while few people would appreciate or take advantage of the performance enhancement.

For certain applications that need both high performance and flexibility, another technique for avoiding interface calls makes sense: returning data in groups, such as arrays. Consider this method of ICollection<T> that is almost never used:
  void CopyTo(T[] array, int arrayIndex);
Imagine that method was gone and replaced with this method in IList:
  // copy this[startIndex..startIndex+count] to array[arrayIndex..arrayIndex+count]
  void CopySlice(T[] array, int arrayIndex, int startIndex, int count);
Now you would have a technique for optimizing access when you need it. Rather than requesting the entire list as an array, which requires unbounded memory, you request a section. How can this be used to optimize a program? Well, you can read the list in small blocks, e.g., 50 elements at a time. Then you can use a for-loop like this one:
  for (int i = 0; i < array.Length; i++) {
    /* do something with array[i] */
  }
which avoids the cost of (1) interface calls and (2) bounds checking (the JIT eliminates the bounds check in a loop that is structured this way), in exchange for the need to allocate a small temporary array (which is usually cheap). Obviously, this is not a convenient way to access a list; again, this would only be for the (rare?) situation that calls for high performance, but still requires the flexibility of interfaces (so you can swap in multiple implementations).

For Loyc, I decided to define a specialized interface for lexers:
public interface ICharSource : IListSource<char>
{
  /// <summary>
  /// Returns a substring from the character source. If some of the
  /// requested characters are past the end of the stream, the string
  /// is truncated to the available number of characters.
  /// </summary>
  /// <param name="startIndex">Index of first character to return.
  /// If startIndex >= Count, an empty string is returned.</param>
  /// <param name="length">Number of characters desired.</param>
  new StringSlice Slice(int startIndex, int length);
}
This interface was created to read characters efficiently. Although a lexer could read characters one-at-a-time from IReadOnlyList<char> or IListSource<char>, it requires at least one dynamic interface dispatch for every character (multiple, if the lexer reads the same character more than once). On the other hand, if lexers avoid this overhead by requiring the entire file in the form of a string, it becomes necessary to hold the entire file in memory at once, in a very specific format (one huge string), which may not be ideal.

StringSlice is a structure in Loyc.Essentials that represents a slice of a string (a tuple of (string str, int startIndex, int length)). Slice() is a compromise that allows the lexer to request small pieces of the file that it can read without dynamic dispatch. If the file happens to be stored as a string, Slice() can simply return a slice of that string (no memory allocation required), otherwise it will build a string and return that.

It's unfortunate that .NET treats strings as something completely different than arrays. In a perfect world (or in D), an interface like this one could support not just substrings, but read-only subarrays of any element type, which would be useful any time you want to optimize your code by reducing dynamic dispatch.

By the way: we're doing it wrong.

Defining zillions of interfaces isn't the best solution.

As I said, the possible combinations of operations that data structures may or may not support is just about endless. Some may support adding items but not removing; others may support removing but not adding; some lists may be read-only but have a specialized, fast IndexOf() method (recall that IListSource<T> does not include IndexOf()). It would be crazy to define a separate interface, and invent a separate name, for every possible combination of operations. Inevitably, there are some data structures for which Loyc.Essentials will offer no appropriate interface. But in .NET, that's inevitable.

My favorite thing about the Go language (which, admittedly, I haven't actually used) is the way they do interfaces. In Go, a data structure does not have to explicitly declare in advance every interface it satisfies. Instead, a data structure can satisfy any interface defined by a third party if it provides the necessary operations. If .NET worked like this, then it wouldn't matter that List<T> does not say it implements IReadOnlyList<T>; you would still be allowed to pass it to a method that takes IReadOnlyList<T>.

So, suppose I define an indexable data structure that supports add-at-the-end, but no other modifications. Then I could define an interface that exactly matches my data structure:
interface IAppendableList<T> : IReadOnlyList<T> { void Add(T item); }

class GuestBook : IAppendableList<GuestEntry> { ... }
Later, someone else who is using MyDataStructure<T> in their code could decide to declare a method that accepts IAppendableList<T>:
void Foo(IAppendableList<T> x) { ... }
We can tell that Foo() will probably add items to x, but won't remove items. If .NET supported Go's ability to adapt to new interfaces with almost no extra run-time cost, it would not only accept an argument of type GuestBook, but it would also accept any of the standard collection classes like List<T>. Nice. This feature could neatly fix flaws 2 and 3 above, even if Microsoft itself never defined any read-only or specialized interfaces.

IMO, this is how it should be, and I would point out that by not supporting this feature, it is not practical for .NET as a platform to support languages like Go that do support it (so much for being a "common language runtime"). Now, I assume we can't add this feature directly to C# because it would break existing code, but the feature could still be supported in .NET itself, if Microsoft had the willpower.

Some people argue against this feature, saying that an "interface" is more than just a set of methods but a contract for how those methods behave. And in Loyc.Essentials itself you'll see two interfaces that reflect that idea:
public interface IQueue<T> : IPush<T>, IPop<T>, ICount
{
}
public interface IStack<T> : IPush<T>, IPop<T>, ICount
{
}
The way I defined queues and stacks is that both of them have the same methods--pushing, popping, and Count--but because they behave quite differently, I defined two separate interfaces. So the argument against "Go interfaces" makes sense, but in my opinion their advantages clearly outweigh the disadvantages, and anyway the feature already exists in some languages, so if Microsoft truly wants a many-language platform, they must make features like this possible, and efficient.

Another argument in favor of Go interfaces is that they make it easier to satisfy dependencies. For example, I defined ICharSource inside Loyc.Essentials even though the interface was designed for lexers (which suggests Loyc.Syntax.dll would be a better place for it), because Loyc.Essentials contains StringSlice and I wanted StringSlice to implement ICharSource. Since Loyc.Syntax depends on Loyc.Essentials, the only way this can work is if ICharSource is part of Loyc.Essentials. Luckily I control both DLLs, otherwise StringSlice could not implement ICharSource.

In general, I find it to be a big nuisance that when two DLLs define identical interfaces, with the same name in the same namespace, the CLR considers them completely unrelated and incompatible, and there is no workaround. If two DLLs want to implement the same interface, they must always have a reference to a third DLL that contains the interface, and both DLLs must rely on the same version of that third DLL even if the interface in question is identical across different versions. The Go approach gets you out of versioning hell (at least it would in the .NET context. Whether Go itself has a similar issue, I have no idea). But I digress...

Conclusion.

Loyc.Essentials is a library of "stuff that should be in the BCL, but isn't." Although it was created for the Loyc project, it is meant as a general-purpose library for any developer's .NET toolbox. If there's a general-purpose interface that you think should be part of the BCL, write it in a comment and we'll discuss whether it should be added to Loyc.Essentials.dll.

Some of these interfaces are used heavily in Loyc, while others are not used at all. In future posts I'll introduce the other classes, structures, and handy methods of Loyc.Essentials.dll, and then I'll move on to Loyc.Collections.dll.

Edit: I missed an interesting interface: IEnumeratorFrame<Frame, T>, which goes with the NestedEnumerator<Frame, T> structure, and is used for enumerating tree-like data structures or manually-implemented coroutines. See the code if you're curious, although this interface would work best as part of a built-in compiler feature for stream flattening like Cω had.

Friday, January 31, 2014

Using Loyc.Essentials: introduction

Loyc.Essentials is a library (DLL) of general-purpose code that supplements the .NET BCL (standard libraries). I hope that one day it will serve as a starting point or inspiration for a multi-language standard library, which will be a standard library that you can use in a wide variety of programming languages (including outside the .NET ecosystem). Currently, however, Loyc.Essentials is designed specifically for .NET, and it is designed to supplement, not replace, the core BCL classes. The Loyc libraries contain only "safe", verifiable code.

Loyc.Essentials contains the following categories of stuff:
  • Collection stuff: interfaces, adapters, helper classes, base classes, extension methods, and implementations for simple "core" collections.
  • Geometry: simple generic geometric interfaces and classes: e.g. IPoint<T>, IPoint3<T>, Vector<T>, Vector3<T>, IRectangle<T>, LineSegment<T>, BoundingBox<T>
  • Math: generic math interfaces that allow arithmetic to be performed in generic code, based on math interfaces originally designed by Rüdiger Klaehn. Also includes fixed-point types, 128-bit integer math, and handy extra math functions in MathEx.
  • Other utilities: message sinks (IMessageSink), object tags (ITags, HashTags), ICloneable<T>, interface adapter generation, Localization (Localize class), Symbol, threading stuff, a miniture clone of NUnit (MiniTest, RunTests), a pluggable internationalization mechanism (Localize class), EzStopwatch, WeakReference<T>, extra general-purpose exception types derived from InvalidOperationException (e.g. InvalidStateException, ReadOnlyException) and miscellaneous low-level functionality including more extension methods.
  • Compatibility: a very small amount of .NET 4.5 stuff, backported to .NET 4.0 (I'm thinking about using a separate DLL for this, maybe using Theraot's libraries)
In the coming days I plan to write a series of blog posts about this and other Loyc libraries.

Tuesday, December 17, 2013

Bogus ambiguity warnings in LLLPG

Sometimes LLLPG gives me a warning that I just can't fathom. I'm currently writing the parser for Enhanced C#, and I got a very puzzling warning:
using TT = TokenType;

partial class EcsParser {
[DefaultK(3)]
LLLPG parser(laType(TokenType), matchType(int), setType(HashSet!int))
{
   alias("(" = TT.LParen);
   alias(")" = TT.RParen);
   alias("[" = TT.LBrack);
   alias("]" = TT.RBrack);
   alias("." = TT.Dot);
   alias("!" = TT.Not);
   alias("<" = TT.LT);
   alias(">" = TT.GT);
   alias("," = TT.Comma);

   // Complex identifier, e.g. accepts x.y or x but rejects x
   token ComplexId() @[
      IdAtom 
      RestOfId()
   ];
   rule IdAtom @[
      (TT.Id|TT.TypeKeyword)
   ];
   rule RestOfId() @[
      // Warning: Optional branch is ambiguous for input such as «TT.Dot TT.Id _»
      // ((TT.Dot|TT.Not|TT.LT), (TT.Id|TT.TypeKeyword|TT.LParen|TT.LBrack|TT.GT), ~()) 
      TParams()?
      DotRestOfId()?
   ];
   rule DotRestOfId() @[
      "." IdAtom
      RestOfId()
   ];
   rule TParams() @[ // type parameters
      ( "<" (ComplexId ("," ComplexId)*)? ">"
      // Also allow Nemerle-style type parameters (e.g. Dictionary.[int,string])
      | "." "[" "]"
      // D-style type parameters (Dictionary!(int, string))
      //| "!" "(" ")"
      )
   ];
}};
I actually started with a 600-line parser and slowly reduced it to this minimal test case. In fact the grammar is unambiguous, so why does this warning occur? (and by the way, LLLPG's example case varies depending on small details of the grammar; at first it always said «TT.LT TT.GT _» but would give some other example if the aliases were removed.) If you can't figure it out, I don't blame you; I actually attached a debugger to see LLLPG's internal state at the time of the warning, and I couldn't figure it out either.

The generated code is wrong for RestOfId(), but correct for the other rules. How is it wrong?

Notice that the code for RestOfId(), as written, accepts ".>" or "<[" as a way of starting TParams(); this isn't really why the rule is wrong though, because ".>" or "<[" are invalid inputs, so it isn't super important which path is taken for those cases. This illustrates what LLLPG is "thinking" though: it has "mixed together" the two arms of TParams, "<" (ComplexId ("," ComplexId)*)? ">" and "." "[" "]". If it sees "<" (from arm 1 of TParams) and then "[" (from arm 2 of TParams), it will call TParams.
void RestOfId()
{
   TokenType la0, la1;
   la0 = LA0;
   if (la0 == TT.Dot || la0 == TT.LT) {
      switch (LA(1)) {
      case TT.LBrack:
      case TT.GT:
      case TT.Id:
      case TT.TypeKeyword:
         TParams();
         break;
      }
   }
   la0 = LA0;
   if (la0 == TT.Dot) {
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword)
         DotRestOfId();
   }
}
The real problem happens when it sees "." (from arm 2 of TParams) followed by TT.Id (from arm 1 of TParams). In that case RestOfId() should skip TParams() and call DotRestOfId instead. But no, it calls TParams.

This explains why the warning occurs. If, inside RestOfId, LLLPG does not separate the logic for the two arms of TParams, LLLPG gets confused and thinks that "." followed by an identifier is a valid input for TParams as well as DotRestOfId. Therefore it can't decide which rule to call, prints a warning, and (because it is greedy by default) calls TParams.

This is exactly the same sort of problem that you used to get in ANTLR 2 with ANTLR's linear approximate lookahead. Although Terrance Parr (author of ANTLR) clearly documented how it works and what it means, users (even me!) find its behavior unexpected and confusing.

By default, LLLPG's lookahead is more powerful than linear approximate lookahead; the example given by Terrance, (A B | C D) | A D, is no problem for LLLPG. I could go into some detail about how my lookahead differs from ANTLR 2, but I think most readers just want to know "how do I fix this" and the answer is simple: add the [FullLLk] attribute, either just to the rule that is giving you trouble, or the entire grammar (the "LLLPG" statement):
partial class EcsParser {
[DefaultK(3)]
LLLPG parser(laType(TokenType), matchType(int), setType(HashSet!int))
{
   ...
   [FullLLk] // eliminates the bogus warning, and fixes the generated code
   rule RestOfId() @[
      TParams()?
      DotRestOfId()?
   ];
   ...
}};
The FullLLk option enables "true" LL(k) prediction, and forces LLLPG to analyze sub-decisions independently. So the decision of whether to call TParams requires not one, but two separate sequences of tests, for the two branches inside TParams:
void RestOfId()
{
   TokenType la0, la1, la2;
   la0 = LA0;
   if (la0 == TT.LT) { // branch 1 of TParams
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword) {
         switch (LA(2)) {
         case TT.GT:
         case TT.Comma:
         case TT.Dot:
         case TT.LT:
            TParams();
            break;
         }
      } else if (la1 == TT.GT)
         TParams();
   } else if (la0 == TT.Dot) { // branch 2 of TParams
      la1 = LA(1);
      if (la1 == TT.LBrack) {
         la2 = LA(2);
         if (la2 == TT.RBrack)
            TParams();
      }
   }
   la0 = LA0;
   if (la0 == TT.Dot) {
      la1 = LA(1);
      if (la1 == TT.Id || la1 == TT.TypeKeyword)
         DotRestOfId();
   }
}
Full LL(k) mode doesn't always work perfectly, and may make LLLPG run slower, which is why it is not enabled by default. But usually, it works fine. If you still get the same ambiguity warning after enabling Full LL(k), check over your grammar carefully, because it is probably genuinely ambiguous.

Update: By the way, I later noticed that ComplexId, RestOfId and DotRestOfId can be combined into a single rule:
[FullLLk]
token ComplexId() @[
    IdAtom
    ( "::" IdAtom )? // omitted from simplified grammar
    TParams()?
    (   "." IdAtom
        TParams()?
    )*
];

Saturday, November 23, 2013

The Loyc LL(k) Parser Generator: part 2

This article is Published on CodeProject.

This article assumes you've read part 1. LLLPG still requires you to write your code in LES rather than C#, but LES is a little friendlier now that I've written a basic syntax-highlighting extension for Visual Studio 2013. Plus, I wrote a single-file generator, so it's more convenient to use LLLPG in your Visual Studio projects. More on that later.

LLLPG-0.91.zip contains LLLPG 0.91, the single-file generator (run LLLPG\LllpgForVisualStudio.exe to install) and the "Calculator" sample project from the first article. The sample project is unchanged, except that it uses the single-file generator to generate code. The sample project can be opened in VS2010 (any edition), and the single-file generator supports any Visual Studio from 2008 to 2013, including Express editions. I just noticed that some machines (and not others), require LllpgForVisualStudio.exe to be run in Administrator mode (in Windows 7, right-click LllpgForVisualStudio.exe and click "Run as administrator"). Otherwise the "Register" button gives an error about permissions.

LesSyntaxHighlightingForVS2013.vsix is the syntax highlighter. Note that although it offers to install in VS 2013 Express, Microsoft won't allow it to work in Express. It also offers to intall in VS 2012 just in case it might work, but I don't think it will.

Table of contents for today:
  1. Do you really need a parser generator?
  2. Parsing terminology: if you already know about stuff like terminals and nullability, skip this.
  3. LL(k) versus the competition: in part one you got a sense of what LL(k) means (choosing a path through a grammar based only on k tokens of lookahead, k >= 1). This time I will compare LL(k) with alternatives such as LALR(1) and PEGs. If you already know or don't care about alternative parsing models, skip this.
  4. Learning LLLPG, with numbers: this section introduces syntactic predicates, semantic predicates, gates, underscore and tilde, and the [k] attribute.
  5. Wrapping Up.

Do you really need a parser generator?

One of the most common introductory examples for any parser generator is an expression parser or calculator, like this calculator bundled with the previous article:
LLLPG parser(laType(int))
{
 rule Atom()::double @[
  { result::double; }
  ( t:=id           { result = Vars[t.Value -> string]; }
  | t:=num          { result = t.Value -> double; } 
  | '-' result=Atom { result = -result; }
  | '(' result=Expr ')'
  | error           { result = 0; Error(InputPosition, "Expected identifer, number, or (stuff)"); }
  )
  { return result; }
 ];
 rule MulExpr()::double @[ 
  result:=Atom
  (op:=(mul|div) rhs:=Atom { result = Do(result, op, rhs); })*
  { return result; }
 ];
 rule AddExpr()::double @[
  result:=MulExpr
  (op:=(add|sub) rhs:=MulExpr { result = Do(result, op, rhs); })*
  { return result; }
 ];
 rule Expr()::double @[
  { result::double; }
  ( t:=id set result=Expr { Vars[t.Value.ToString()] = result; }
  | result=AddExpr )
  { return result; }
 ];
};

def Do(left::double, op::Token, right::double)::double
{
 switch op.Type {
  case add; return left + right;
  case sub; return left - right;
  case mul; return left * right;
  case div; return left / right;
 };
 return double.NaN; // unreachable
};
But if expression parsing is all you need, you don't really need a parser generator; there are simpler options for parsing, such as using a Pratt Parser like this one. If you only need to parse simple text fields like phone numbers, you can use regular expressions. And even if you need an entire programming language, you don't necessarily need to create your own; for example you could re-use the LES parser included in Loyc.Syntax.dll, which comes with LLLPG (it's not 100% complete yet, but it's usable enough for LLLPG), and if you don't re-use the parser you might still want to re-use the lexer.

So before you go writing a parser, especially if it's for something important rather than "for fun", seriously consider whether an existing parser would be good enough. Let me know if you have any questions about LES, Loyc trees, or using the Loyc libraries (Loyc.Essentials.dll, Loyc.Collections.dll, and so on).

Parsing terminology

I'll start with a short glossary of standard parsing terminology.

First of all, grammars consist of terminals and nonterminals.

A terminal is an item from the input; when you are defining a lexer, a terminal is a single character, and when you are defining a parser, a terminal is a token from the lexer. More specifically, the grammar is concerned only with the type of the token, not its value. For example one of your token types might be Number, and a parser cannot treat a particular number specially; so if you ever need to treat the number "0" differently than other numbers, then your lexer would have to create a special token type for that number (e.g. Zero), because the grammar cannot make decisions based on a token's value.

A nonterminal is a rule in the grammar. So in this grammar:
 token Spaces @[ (' '|'\t')+ ];
 token Id @[
  ('a'..'z'|'A'..'Z'|'_')
  ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*
 ];
 token Int @[ '0'..'9'+ ];
 token Token @[ Spaces | Id | Int ];
The nonterminals are Spaces, Id, Int and Token, while the terminals are inputs like 's', '\t', '9', and so forth.

Traditional literature about parsing assumes that there is a single "Start Rule" that reprsents the entire language. For example, if you want to parse C#, then the start rule would say something like "zero or more using statements, followed by zero or more namespaces and/or classes":
  rule Start @[ UsingStmt* TopLevelDecl* ];
  rule TopLevelDecl @[ NamespaceStmt | TypeDecl ];
  ...
However, LLLPG is more flexible than that. It doesn't limit you to one start rule; instead, LLLPG assumes you might start with any rule that isn't marked "private". After all, you might want to parse just a subset of the language, such as a method declaration, a single statement, or a single expression.

LLLPG rules express grammars in a somewhat non-standard way. Firstly, LLLPG notation is based on ANTLR notation, which itself is mildly different from EBNF (Extended Backus–Naur Form) notation which, in some circles, is considered more standard. Also, because LLLPG is embedded inside another programming language (LES or EC#), it uses the @[...] notation which means "token literal". A token literal is a sequence of tokens that is not interpreted by the host language; the host language merely figures out the boundaries of the "token literal" and saves the tokens so that LLLPG can use them later. So while a rule in EBNF might look like this:
 Atom = ['-'], (num | id);
the same rule in LLLPG-in-LES looks like this:
 rule Atom @[ ('-')? (num | id) ];
In EBNF, [foo] means foo is optional, while LLLPG uses foo? to express the same idea. Some people use {foo} to represent a list of zero or more foos, while LLLPG uses foo* for the same thing. On the plus side, alternatives and grouping work the same way in LLLPG and EBNF (e.g. (a | b)), although LLLPG also has the notation (a / b) which I'll explain later.

In addition, some grammar representations do not allow loops, or even optional items. For example, formal "four-tuple" grammars are defined this way. I discuss this further in my blog post, Grammars: theory vs practice.

A language is a different concept than a grammar. A grammar represents some kind of language, but generally there are many possible grammars that could represent the same language. The word "language" refers to the set of sentences that are considered valid; two different grammars represent the same language if they accept or reject the same input (or, looking at the matter in reverse, if you can generate the same set of "sentences" from both grammars). For example, the following four rules all represent a list of digits:
 rule Digits1 @[ '0'..'9'+ ];
 rule Digits2 @[ '0'..'9'* '0'..'9' ];
 rule Digits3 @[ '0'..'9' | Digits3 '0'..'9' ];
 rule Digits4 @[ ('0'..'9'+)* ];
If we consider each rule to be a separate grammar, the four grammars all represent the same language (a list of digits). But the grammars are of different types: Digits1 is an LL(1) grammar, Digits2 is LL(2), Digits3 is LALR(1), and I don't know what the heck to call Digits4 (it's highly ambiguous, and weird). Since LLLPG is an LL(k) parser generator, it supports the first two grammars, but can't handle the other two; it will print warnings about "ambiguity" for Digits3 and Digits4, then generate code that doesn't work properly. Actually, while Digits4 is truly ambiguous, Digits3 is actually unambiguous. However, Digits3 is ambiguous "in LL(k)", meaning that it is ambiguous from the LL(k) perspective, i.e. from LLLPG's perspective.

The word "nullable" means "can match nothing". For example, @[ '0'..'9'* ] is nullable because it successfully "matches" an input like "hello, world" by doing nothing; but @[ '0'..'9'+ ] is not nullable, and will only match something that starts with at least one digit.

LL(k) versus the competition

LLLPG is in the LL(k) family of parser generators. It is suitable for writing both lexers (also known as tokenizers or scanners) and parsers, but not for writing one-stage parsers that combine lexing and parsing into one step. It is more powerful than LL(1) parser generators such as Coco/R.

LL(k) parsers, both generated and hand-written, are very popular. Personally, I like the LL(k) class of parsers because I find them intuitive, and they are intuitive because when you write a parser by hand, it is natural to end up with a mostly-LL(k) parser. But there are two other popular families of parser generators for computer languages, based on LALR(1) and PEGs:
  • LALR(1) parsers are simplified LR(1) parsers which--well, it would take a lot of space to explain what LALR(1) is. Suffice it to say that LALR(1) parsers support left-recursion and use table-based lookups that are impractical to write by hand, i.e. a parser generator or similar infrastructure is always required, unlike LL(k) grammars which are straightforward to write by hand and therefore straightforward to understand. LALR(1) supports neither a superset nor a subset of LL(k) grammars; While I'm familiar with LALR(1), I admittedly don't know anything about its more general cousin LR(k), and I am not aware of any parser generators for LR(k) or projects using LR(k). Regardless of merit, LR(k) for k>1 just isn't very popular. I believe the main advantage of the LR family of parsers is that LR parsers support left- recursion while LL parsers do not (more on that later).
  • PEGs are recursive-descent parsers with syntactic predicates, so PEG grammars are potentially very similar to grammars that you would use with LLLPG. However, PEG parsers traditionally do not use prediction (although prediction could probably be added as an optimization), which means that they don't try to figure out in advance what the input means; for example, if the input is "42", a PEG parser does not look at the '4' and decide "oh, that looks like a number, so I'll call the number sub-parser". Instead, a PEG parser simply "tries out" each option starting with the first (is it spaces? is it an identifier?), until one of them successfully parses the input. Without a prediction step, PEG parsers apparently require memoization for efficiency (that is, a memory of failed and successful matches at each input character, to avoid repeating the same work over and over in different contexts). PEGs typically combine lexing (tokenization) and parsing into a single grammar (although it is not required), while other kinds of parsers separate lexing and parsing into independent stages.
Of course, I should also mention regular expressions, which probably the most popular parsing tool of all. However, while you can use regular expressions for simple parsing tasks, they are worthless for "full-scale" parsing, such as parsing an entire source file. The reason for this is that regular expressions do not support recursion; for example, the following rule is impossible to represent with a regular expression:
 rule PairsOfParens @[ '(' PairsOfParens? ')' ]; 
Because of this limitation, I don't think of regexes as a "serious" parsing tool.

Other kinds of parser generators also exist, but are less popular. Note that I'm only talking about parsers for computer languages; Natural Language Processing (e.g. to parse English) typically relies on different kinds of parsers that can handle ambiguity in more flexible ways. LLLPG is not really suitable for NLP.

As I was saying, the main difference between LL(k) and its closest cousin, the PEG, is that LL(k) parsers use prediction and LL(k) grammars usually suffer from ambiguity, while PEGs do not use prediction and the definition of PEGs pretends that ambiguity does not exist because it has a clearly-defined system of prioritization.

"Prediction" means figuring out which branch to take before it is taken. In a "plain" LL(k) parser (without and-predicates), the parser makes a decision and "never looks back". For example, when parsing the following LL(1) grammar:
public rule Tokens @[ Token* ];
public rule Token  @[ Float | Id | ' ' ];
token Float        @[ '0'..'9'* '.' '0'..'9'+ ];
token Id           @[ IdStart IdCont* ];
rule  IdStart      @[ 'a'..'z' | 'A'..'Z' | '_' ];
rule  IdCont       @[ IdStart | '0'..'9' ];
The Token method will get the next input character (known as LA0 or lookahead zero), check if it is a digit or '.', and call Float if so or Id (or consume a space) otherwise. If the input is something like "42", which does not match the definition of Float, the problem will be detected by the Float method, not by Token, and the parser cannot back up and try something else. If you add a new Int rule:
...
public rule Token @[ Float | Int | Id ];
token Float       @[ '0'..'9'* '.' '0'..'9'+ ];
token Int         @[ '0'..'9'+ ];
token Id          @[ IdStart IdCont* ];
...
Now you have a problem, because the parser potentially requires infinite lookahead to distinguish between Float and Int. By default, LLLPG uses LL(2), meaning it allows at most two characters of lookahead. With two characters of lookahead, it is possible to tell that input like "1.5" is Float, but it is not possible to tell whether "42" is a Float or an Int without looking at the third character. Thus, this grammar is ambiguous in LL(2), even though it is unambiguous when you have infinite lookahead. The parser will handle single-digit integers fine, but given a two-digit integer it will call Float and then produce an error because the expected '.' was missing.

A PEG parser does not have this problem; it will "try out" Float first and if that fails, the parser backs up and tries Int next. There's a performance tradeoff, though, as the input will be scanned twice for the two rules.

Although LLLPG is designed to parse LL(k) grammars, it handles ambiguity similarly to a PEG: if A|B is ambiguous, the parser will choose A by default because it came first, but it will also warn you about the ambiguity.

Since the number of leading digits is unlimited, LLLPG will consider this grammar ambiguous no matter how high your maximum lookahead k (as in LL(k)) is. You can resolve the conflict by combining Float and Int into a single rule:
public rule Tokens @[ Token* ];
public rule Token  @[ Number | Id ];
token Number       @[ '.' '0'..'9'+
                    | '0'..'9'+ ('.' '0'..'9'+)? ];
token Id           @[ IdStart IdCont* ];
...
Unfortunately, it's a little tricky sometimes to merge rules correctly. In this case, the problem is that Int always starts with a digit but Float does not. My solution here was to separate out the case of "no leading digits" into a separate "alternative" from the "has leading digits" case. There are a few other solutions you could use, which I'll discuss later in this article.

I mentioned that PEGs can combine lexing and parsing in a single grammar because they effectively support unlimited lookahead. To demonstrate why LL(k) parsers usually can't combine lexing and parsing, imagine that you want to parse a program that supports variable assignments like "x = 0" and function calls like x(0), something like this:
// "Id" means identifier, LParen means left parenthesis '(', etc.
rule Expr    @[ Assign | Call | ... ];
rule Assign  @[ Id Equals Expr ];
rule Call    @[ Id LParen ArgList ];
rule ArgList ...
...
If the input is received in the form of tokens, then this grammar only requires LL(2): the Expr parser just has to look at the second token to find out whether it is Equals ('=') or LParen ('(') to decide whether to call Assign or Call. However, if the input is received in the form of characters, no amount of lookahead is enough! The input could be something like
this_name_is_31_characters_long = 42;
To parse this directly from characters, 33 characters of lookahead would be required (LL(33)), and of course, in principle, there is no limit to the amount of lookahead. Besides, LLLPG is designed for small amounts of lookahead like LL(2) or maybe LL(4); a double-digit value is almost always a mistake. LL(33) could produce a ridiculously large and inefficient parser (I'm too afraid to even try it.)

In summary, LL(k) parsers are not as flexible as PEG parsers, because they are normally limited to k characters or tokens of lookahead, and k is usually small. PEGs, in contrast, can always "back up" and try another alternative when parsing fails. LLLPG makes up for this problem with "syntactic predicates", which allow unlimited lookahead, but you must insert them yourself, so there is slightly more work involved and you have to pay some attention to the lookahead issue. In exchange for this extra effort, though, your parsers are likely to have better performance, because you are explicitly aware of when you are doing something expensive (large lookahead). I say "likely" because I haven't been able to find any benchmarks comparing LL(k) parsers to PEG parsers, but I've heard rumors that PEGs are slower, and intuitively it seems to me that the memoization and retrying required by PEGs must have some cost, it can't be free. Prediction is not free either, but since lookahead has a strict limit, the costs usually don't get very high. Please note, however, that using syntactic predicates excessively could certainly cause an LLLPG parser to be slower than a PEG parser, especially given that LLLPG does not use memoization.

Let's talk briefly about LL(k) versus LALR(1). Sadly, my memory of LALR(1) has faded since university, and while Googling around, I didn't find any explanations of LALR(1) that I found really satisfying. Basically, LALR(1) is neither "better" nor "worse" than LL(k), it's just different. As wikipedia says:
The LALR(k) parsers are incomparable with LL(k) parsers – for any j and k both greater than 0, there are LALR(j) grammars that are not LL(k) grammars and conversely. In fact, it is undecidable whether a given LL(1) grammar is LALR(k) for any k >= 0.
But I have a sense that most LALR parser generators in widespread use support only LALR(1), not LALR(k); thus, roughly speaking, LLLPG is more powerful than an LALR(1) parser generator because it can use multiple lookahead tokens. However, since LALR(k) and LL(k) are incompatible, you would have to alter an LALR(1) grammar to work in an LL(k) parser generator and vice versa.

Here are some key points about the three classes of parser generators.

LL(k):
  • Straightforward to learn: just look at the generated code, you can see how it works directly.
  • Straightforward to add custom behavior with action code ({...}).
  • Straightforward default error handling. When unexpected input arrives, an LL(k) parser can always report what input it was expecting at the point of failure.
  • Predictably good performance. Performance depends on how many lookahead characters are actually needed, not by the maximum k specified in the grammar definition. Performance of LL(k) parsers can be harmed by nesting rules too deeply, since each rule requires a method call (and often a prediction step); specifically, expression grammars often have this problem. But in LLLPG you can use a trick to handle large expression grammars without deep nesting (I suppose I'll talk about this in a future article; my technique is demonstrated in my LES grammar.)
  • LL(k) parser generators, including LLLPG, may support valuable extra features outside the realm of LL(k) grammars, such as zero-width assertions (a.k.a. and-predicates), and multiple techniques for dealing with ambiguity.
  • No left recursion allowed (direct or indirect)
  • Strictly limited lookahead. It's easy to write grammars that require unlimited lookahead, so such grammars will have to be modified to work in LL(k). By the way, ANTLR overcomes this problem with techniques such as LL(*).
  • Low memory requirements.
  • Usually requires a separate parser and lexer.
LALR(1):
  • Challenging to learn if you can't find a good tutorial; the generated code may be impossible to follow
  • Supports left recursion (direct and indirect)
  • Some people feel that LALR(1) grammars are more elegant. For example, consider a rule "List" that represents a comma-separated list of numbers. The LALR(1) form of this rule would be num | List ',' num, while the LL(1) form of the rule is num (',' num)*. Which is better? You decide.
  • Low memory requirements.
  • Usually requires a separate parser and lexer.
  • LALR parser generators often have explicit mechanisms for specifying operator precedence.
PEGs:
  • New! (formalized in 2004)
  • Unlimited lookahead
  • "No ambiguity" (wink, wink) - or rather, any ambiguity is simply ignored, "first rule wins"
  • Supports zero-width assertions as a standard feature
  • Grammars are composable. It is easier to merge different PEG parsers than different LL/LR parsers.
  • Performance characteristics are not well-documented (as far as I've seen), but my intuition tells me to expect a naive memoization-based PEG parser generator to produce generally slow parsers. That said, I think all three parser types offer roughly O(N) performance to parse a file of size N.
  • High memory requirements (due to memoization).
  • It's non-obvious how to support "custom actions". Although a rule may parse successfully, its caller can fail (and often does), so I guess any work that a rule performs must be transactional, i.e. it must be possible to undo the action.
  • Supports unified grammars: a parser and lexer in a single grammar.
Regexes:
  • Very short, but often very cryptic, syntax.
  • Matches characters directly; not usable for token-based parsing.
  • Incapable of parsing languages of any significant size, because they do not support recursion or multiple "rules".
  • Most regex libraries have special shorthand directives like \b that require significantly more code to express when using a parser generator.
  • Regexes are traditionally interpreted, but may be compiled. Although regexes do a kind of parsing, regex engines are not called "parser generators" even if they generate code.
  • Regexes are closely related to DFAs and NFAs.
Let me know if I missed anything.

Learning LLLPG, with numbers

So you still want to use LLLPG, even knowing about the competition? Phew, what a relief! Okay, let's parse some numbers. Remember that code from earlier?
public rule Tokens @[ Token* ];
public rule Token  @[ Float | Id | ' ' ];
token Float        @[ '0'..'9'* '.' '0'..'9'+ ];
token Int          @[ '0'..'9'+ ];
token Id           @[ IdStart IdCont* ];
rule  IdStart      @[ 'a'..'z' | 'A'..'Z' | '_' ];
rule  IdCont       @[ IdStart | '0'..'9' ];
If you were hoping that I'd point out the difference between token and rule at this point... nope, I'm saving that for the next article. It's not important yet; just use token for defining token rules (like Float, Int and Id) and you'll be fine.

Now, since Float and Int are ambiguous in LL(k), here's how I suggested combining them into a single rule:
token Number       @[ '.' '0'..'9'+
                    | '0'..'9'+ ('.' '0'..'9'+)? ];
That's perhaps the best solution; other solutions have pitfalls, but the alternatives are still really handy for learning about LLLPG! First of all, one solution that looks simpler is this one:
token Number      @[ '0'..'9'* ('.' '0'..'9'+)? ];
But now you'll have a problem, because this matches an empty input, or it matches "hello" without consuming any input. Therefore, LLLPG will complain that Token is "nullable" and therefore must not be used in a loop (see the Token* loop?). After all, if you call Number in a loop and it doesn't match anything, you'll have an infinite loop which is very bad.

Zero-width assertions: syntactic predicates

You can actually prevent it from matching an empty input as follows:
token Number @[ &('0'..'9'|'.')
                '0'..'9'* ('.' '0'..'9'+)? ];
Here I have introduced the zero-width assertion or ZWA, also called an and-predicate because it uses the "and" sign (&). There are two kinds of and-predicates, which are called the "syntactic predicate" and the "semantic predicate". This one is a syntactic predicate, which means that it tests for syntax, which means it tests a grammar fragment starting at the current position. And since it's a zero-width assertion, it does not consuming any input (more precisely, it consumes input and then backtracks to the starting position, regardless of success or failure). and-predicates have two forms, the positive form & which checks that a condition is true, and the negative form &! which checks that a condition is false.

So, &('0'..'9'|'.') means that the number must start with '0'..'9' or '.'. Now Number() cannot possibly match an empty input. Unfortunately, LLLPG is not smart enough to know that it cannot match an empty input; it does not currently analyze and-predicates at all (it merely runs them), so it doesn't understand the effect caused by &('0'..'9'|'.'). Consequently it will still complain that Token is nullable even though it isn't. Hopefully this will be fixed in a future version, when I or some other smart cookie has time to figure out how to perform the analysis.

A syntactic predicate causes two methods to be generated to handle it:
private bool Try_Number_Test0(int lookaheadAmt)
{
  using (new SavePosition(this, lookaheadAmt))
    return Number_Test0();
}
private bool Number_Test0()
{
  if (!TryMatchRange('.', '.', '0', '9'))
    return false;
  return true;
}
The first method assumes the existance of a data type called SavePosition (as usual, you can define it yourself or inherit it from a base class that I wrote.) SavePosition's job is to
  1. Save the current InputPosition, and then restore that position when Dispose() is called
  2. Increment InputPosition by lookaheadAmt (which is often zero, but not always), where 0 <= lookaheadAmt < k.
The Try_ method is used to start a syntactic predicate and then backtrack, regardless of whether the result was true or false. The second method decides whether the expected input is present at the current input position. Syntactic predicates also assume the existence of TryMatch and (for lexers) TryMatchRange methods, which must return true (and advance to the next the input position) if one of the expected characters (or tokens) is present, or false if not.

Zero-width assertions: semantic predicates

Moving on now, another approach is
token Number  @[ {dot::bool=false;}
                 ('.' {dot=true;})?
                 '0'..'9'+ (&{!dot} '.' '0'..'9'+)?
               ];
Here I use the other kind of ZWA, the semantic predicate, to test whether dot is false (&{!dot}, which can be written equivalently as &!{dot}). &{expr} simply specifies a condition that must be true in order to proceed; it is normally used to resolve ambiguity between two possible paths through the grammar. Semantic predicates are a distinctly simpler feature than syntactic predicates and were implemented first in LLLPG. They simply test the user-defined expression during prediction.

So, here I have created a "dot" flag which is set to "true" if the first character is a dot (you write {dot::bool=false;} in LES, but when LLLPG adds C# input support, you can write {bool dot=false;} instead). The sequence '.' '0'..'9'+ is only allowed if the "dot" flag has not been set. This approach works correctly; however, you must exercise caution when using &{...} because &{...} blocks may execute earlier than you might expect them to; this is explained below.

The expression inside &{...} can include the "substitution variables" $LI and $LA, which refer to the current lookahead index and the current lookahead value; these are useful if you want to run a test on the input character. For example, when if you want to detect letters, you might write
rule Letter @[ 'a'..'z' | 'A'..'Z' ];
token Word @[ Letter+ ];
but this doesn't detect all possible letters; there are áĉçèntéd letters to worry about, grΣεk letters, Russiaи letters and so on. I've been supporting these other letters with a semantic and-predicate:
rule Letter @[ 'a'..'z' | 'A'..'Z'| &{char.IsLetter($LA -> char)} 0x80..0xFFFC ];
[FullLLk] token Word @[ Letter+ ];
0x80..0xFFFC denotes all the non-ASCII characters supported by a .NET char, while the arrow operator -> is the LeMP/LES notation for a cast; the C# equivalent is char.IsLetter((char) $LA). $LA will be replaced with the appropriate lookahead token, which is most often LA0, but not always. Now, when you look at the generated code, you'll notice that the and-predicate has been copied into the Word rule:
void Word()
{
  int la0;
  Letter();
  for (;;) {
    la0 = LA0;
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Letter();
    else if (la0 >= 128 && la0 <= 65532) {
      la0 = LA0;
      if (char.IsLetter((char) la0))
        Letter();
      else
        break;
    } else
      break;
  }
}
Copying and-predicates across rules is normal behavior that occurs whenever one rule needs to use the and-predicate to decide whether to call another rule. ANTLR calls this "hoisting", so that's what I call it too: the predicate was hoisted from Letter to Word. (In this particular case I had to add FullLLk to make it work this way; more about that in a future article.)

Gates

Here's one final technique:
token Number  @[ '.'? '0'..'9'+ =>
                 '0'..'9'* ('.' '0'..'9'+)? ];
This example introduces a feature that I call "the gate". The grammar fragment ('.'? '0'..'9'+) before the gate operator => is not actually used by Number itself, but it can be used by the caller to decide whether to invoke the rule.

A gate is an advanced but simple mechanism to alter the way prediction works. Recall that parsing is a series of prediction and matching steps. First the parser decides what path to take next, which is called "prediction", then it matches based on that decision. Normally, prediction and matching are based on the same information. However, a gate => causes different information to be given to prediction and matching. The left-hand side of the gate is used for the purpose of prediction analysis; the right-hand side is used for matching.

The decision of whether or not Token will call the Number rule is a prediction decision, therefore it uses the left-hand side of the gate. This ensures that the caller will not believe that Number can match an empty input. When code is generated for Number itself, the left-hand side of the gate is ignored because it is not part of an "alts" (i.e. the gate expression is not embedded in a "*" or "+" loop or an optional element "?"). Instead, Number runs the matching code, which is the right-hand side, '0'..'9'* ('.' '0'..'9'+)?.

Gates are a way of lying to the prediction system. You are telling it to expect a certain input, then saying "no, no, match this other input instead." Gates are rarely needed, but they can provide simple solutions to certain tricky problems.

Please note that gates, unlike syntactic predicates, do not provide unlimited lookahead. For example, if k=2, the characters 'c' 'd' in ('a' 'b' 'c' 'd' => ...) will not have any effect.

The gate operator => has higher precedence than |, so a | b => c | d is parsed as a | (b => c) | d.

More about and-predicates

Let's see now, I was saying something about and-predicates running earlier than you might expect. Consider this example:
flag::bool = false;
public rule Paradox @[ {flag = true;} &{flag} 'x' / 'x' ];
Here I've introduced the "/" operator. It behaves identically to the "|" operator, but has the effect of suppressing warnings about ambiguity between the two branches (both branches match 'x' if flag == true, so they are certainly ambiguous).

What will the value of 'flag' be after you call Paradox()? Since both branches are the same ('x'), the only way LLLPG can make a decision is by testing the and-predicate &{flag}. But the actions {flag=false;} and {flag=true;} execute after prediction, so &{flag} actually runs first even though it appears to come after {flag=true;}. You can clearly see this when you look at the actual generated code:
bool flag = false;
public void Paradox()
{
  if (flag) {
    flag = true;
    Match('x');
  } else
    Match('x');
}
What happened? Well, LLLPG doesn't bother to read LA0 at all, because it won't help make a decision. So the usual prediction step is replaced with a test of the and-predicate &{flag}, and then the matching code runs ({flag = true;} 'x' for the left branch and 'x' for the right branch).

This example will give the following warning: "It's poor style to put a code block {} before an and-predicate &{} because the and-predicate normally runs first."

In a different sense, though, and-predicates might run after you might expect. Let's look at the code for this Number rule from earlier:
token Number  @[ {dot::bool=false;}
               ('.' {dot=true;})?
               '0'..'9'+ (&{!dot} '.' '0'..'9'+)?
             ];
The generated code for this rule is
void Number()
{
int la0, la1; bool dot = false; la0 = LA0; if (la0 == '.') { Skip(); dot = true; } MatchRange('0', '9'); for (;;) { la0 = LA0; if (la0 >= '0' && la0 <= '9') Skip(); else break; }
la0 = LA0; if (la0 == '.') { if (!dot) { la1 = LA(1); if (la1 >= '0' && la1 <= '9') { Skip(); Skip(); for (;;) { la0 = LA0; if (la0 >= '0' && la0 <= '9') Skip(); else break; } } } } }
Here I would draw your attention to the way that (&{!dot} '.' '0'..'9'+)? is handled: first LLLPG checks if (la0 == '.'), then if (!dot) afterward, even though &{!dot} is written first in the grammar. Another example shows more specifically how LLLPG behaves:
token Foo @[ (&{a()} 'A' &{b()} 'B')? ];

void Foo()
{
  int la0, la1;
  la0 = LA0;
  if (la0 == 'A') {
    if (a()) {
      la1 = LA(1);
      if (la1 == 'B') {
        if (b()) {
          Skip();
          Skip();
        }
      }
    }
  }
}
First LLLPG tests for 'A', then it checks &{a()}, then it tests for 'B', and finally it checks &{b()}; it is as if the and-predicates are being "bumped" one position to the right. Actually, I decided that all zero-width assersions should work this way for the sake of performance. To understand this, consider the Letter and Word rules from earlier:
  rule Letter @[ 'a'..'z' | 'A'..'Z'| &{char.IsLetter($LA -> char)} 0x80..0xFFFC ];
  [FullLLk] token Word @[ Letter+ ];
In the code for Word you can see that char.IsLetter is called after the tests on LA0:
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Letter();
    else if (la0 >= 128 && la0 <= 65532) {
      la0 = LA0;
      if (char.IsLetter((char) la0))
        Letter();
      else
        break;
    } else
      break;
And this makes sense; if char.IsLetter were called first, there would be little point in testing for 'a'..'z' | 'A'..'Z' at all. It makes even more sense in the larger context of a "Token" rule like this one:
[FullLLk] rule Token @[ Spaces / Word / Number / Punctuation / Comma / _ ];
The Token method will look something like this (some newlines removed for brevity):
void Token()
{
  int la0, la1;
  la0 = LA0;
  switch (la0) {
  case '\t':  case ' ':
    Spaces();
    break;
  case '.':
    {
      la1 = LA(1);
      if (la1 >= '0' && la1 <= '9')
        Number();
      else
        Punctuation();
    }
    break;
  case '0':  case '1':  case '2':  case '3':  case '4':
  case '5':  case '6':  case '7':  case '8':  case '9':
    Number();
    break;
  case '!':  case '#':  case '$':  case '%':  case '&':
  case '*':  case '+':  case ',':  case '-':  case '/':
  case '<':  case '=':  case '>':  case '?':  case '@':
  case '\\': case '^':  case '|':
    Punctuation();
    break;
  default:
    if (la0 >= 'A' && la0 <= 'Z' || la0 >= 'a' && la0 <= 'z')
      Word();
    else if (la0 >= 128 && la0 <= 65532) {
      la0 = LA0;
      if (char.IsLetter((char) la0))
        Word();
      else
        MatchExcept();
    } else
      MatchExcept();
    break;
  }
}
In this example, clearly it makes more sense to examine LA0 before checking &{char.IsLetter(...)}. If LLLPG invoked the and-predicate first, the code would have the form
void Token()
{
  int la0, la1;
  la0 = LA0;
  if (char.IsLetter((char) la0)) {
    switch (la0) {
      // All the same switch cases as before
    }
  } else {
    switch (la0) {
      // All the same switch cases again!
    }
  }
}
The code of Token would be much longer, and slower too, since we'd call char.IsLetter on every single input character, not just the ones in the unicode range 0x80..0xFFFC. Clearly, then, as a general rule it's good that LLLPG tests the character values before the ZWAs.

Underscore and tilde

The underscore _ means "match any terminal", while ~(X|Y) means "match any terminal except X or Y". The next section has an example that uses both. In ~(X|Y), X and Y must be terminals (if X and/or Y are non-terminals, consider using something like &!(X|Y) _ instead.)

Setting lookahead

Pure LL(k) parsers look up to k terminals ahead to make a branching decision, and once a decision is make they stick to it, they don't "backtrack" or try something else. So if k is too low, LLLPG will generate code that makes incorrect decisions.

LLLPG's default k value is 2, which is enough in the majority of situations, as long as your grammar is designed to be LL(k). To increase k to X, simply add a [DefaultK(X)] attribute to the grammar (i.e. the LLLPG statement), or add a [k(X)] attribute to a single rule. Here's an example that represents "double-quoted" and """triple-quoted""" strings, where k=2 two is not enough:
private token DQString @[
 '"' ('\\' _  | ~('"'|'\\'|'\r'|'\n'))* '"'? ];
];
[k(4)]
private token TQString @[
 '"' '"' '"' nongreedy(Newline / _)* '"' '"' '"' 
 "'''"       nongreedy(Newline / _)* "'''"
];
[k(4)]
private token Token @[
 ( {_type = TT.Spaces;}    Spaces
 ...
 | {_type = TT.String;}    TQString
 | {_type = TT.String;}    DQString
 ...
 )
];
Here I've used "_" inside both kinds of strings, meaning "match any character", but of course this implies that the string can go on and on forever. To fix that, I add nongreedy meaning "exit the loop when it makes sense to do so" (greedy and nongreedy are explained more in my blog.)

With only two characters of lookahead, LLLPG cannot tell whether """this""" is an empty DQString ("") or a triple-quoted TQString. Since TQString is listed first, LLLPG will always choose TQString when a Token starts with "", but of course this may be the wrong decision. You'll also get a warning like this one:
warning : Loyc.LLParserGenerator.Macros.run_LLLPG: 
Alternatives (4, 5) are ambiguous for input such as «""» (["], ["])
[k(3)] is sufficient in this case, but it's okay if you use a number that is a little higher than necessary, so I've used [k(4)] here. In "bad" grammars that are always ambiguous in LL(k) (e.g. LL(*) or left-recursive grammars), I suspect that a high k value can lead to very large generated code, which is why I don't recommend using a really high DefaultK value.

Wrapping Up

By now we've covered all the basic features, and after you read my blog post on greedy & nongreedy, you should be able to write some pretty fancy parsers!

At the top of the article I discussed the Custom Tool (single-file generator) for VS2008-VS2013, and the LES syntax highlighter. For now, the LES syntax highlighter only works in Visual Studio 2013 (Pro), and it doesn't have all the features I want yet. In particular, it does't highlight the important difference between "foo(x)" and "foo (x)" (see my Loyc Expression Syntax article for details.)

In future articles I'll talk more about the following topics. Let me know what you'd like me to cover first!
  • token versus rule.
  • The API that LLLPG uses (and typical base class implementations): Match(), Skip(), MatchRange(), LA(k), LT(k), etc.
  • Saving inputs: The value of a terminal can be assigned to a variable using x:='a'..'z' to create a variable 'x', x='a'..'z' to assign the value to an existing variable 'x', or xs+='a'..'z' to add the value to a list 'xs' by calling list.Add(MatchRange('a', 'z')).
  • Managing ambiguity: most computer languages are actually somewhat ambiguous. LLLPG will first detect ambiguity and warn you about it using an example of an ambiguous input ("Alternatives (2, 4) are ambiguous for input such as «0x_»"). Branches are specified in priority order, and you can suppress warnings by separating branches with / (which works the same way as |, but slanted). In case of ambiguity with the exit branch, you can use "greedy" or "nongreedy" to specify whether looping or exiting, respectively, should have higher priority.
  • Error handling: By default, LLLPG generates prediction code with the assumption the input is always well-formed. But there are a couple of ways to handle errors, including an "error branch" as in (a | b | error { HandleError(); }), and the [NoDefaultArm] grammar attribute. Alternatively, you can designate one branch as the default (the default default branch is the last branch or the exit branch.) 
  • [FullLLk] versus "approximate" LL(k): by default, LLLPG uses a prediction system slightly weaker than LL(k). It resembles Terrance Parr's "Linear Approximate Lookahead", except that my version is a bit more powerful.
  • My favorite feature that LLLPG doesn't have: length-based disambiguation.
  • A real-life example: parsing LES, or other stuff. Of course, the sample code included with this article is almost a real-life example already. Other parsing suggestions welcome.
  • How-to: "tree parsing": how to use LLLPG with three-stage parsing, where you group tokens by parenthesis and braces before the main parsing stage.
  • How-to: keyword parsing: LLLPG can handle keywords directly, although the generated code is rather large.
  • How LLLPG works: a brief tour of the source code.
  • All about Loyc and its libraries
  • Things you can do with LeMP: other source code manipulators besides LLLPG.
  • Do you want LLLPG to support a language other than C#? Let me know which language. Better yet, ask me how to write a new output module, and I'll write a whole article just about that.
I would also love to hear how you're planning to use LLLPG in your own projects. And don't forget to share this article with your compiler-writing friends and university professors! (Dude, your friends write compilers? I wish I had friends like that.)