Friday, June 14, 2013

The root of all evil is optimization? Or apathy?

You've probably heard the refrain "Premature optimization is the root of all evil".

Well, how did that turn out? Every Windows computer is filled with little gadgets in the system tray. Users may not know what those little icons are, but each one seems to make the hard drive thrash for 10 additional seconds on startup, and uses 10 additional megabytes of RAM. Plus there's all these hidden services that the user cannot see or measure, but slow down the PC just as much as those system tray icons. Of course, some of these apps are small, optimized, and necessary, but a few bad apples (which users have no way to locate) ruin the startup experience.

I have a somewhat different opinion: I think that optimization is the main job of many programmers—not all programmers by any means, but many. Look, here's an algorithm that searches a list of points for the closest one:

  using System.Windows;
  ...
  public int ClosestPointTo(Point near, IList<Point> list)
  {
    return list.IndexOfMin(p => (p - near).Length);
  }

Well, that was easy! So how come I spent days researching and writing an R*-tree implementation? Because the easy solution is just too damn slow! Anybody can find "naive" solutions to problems, and if that was all we needed, it would be easy to find programmers that are "good enough". But inevitably, as the problem size grows larger, the naïve solution isn't good enough. And when the problem size gets really huge (as it inevitably will for somebody), even the solutions everyone thought were good become useless.

I admit, I have a bad habit of premature optimization, and it is a vice, sometimes. For example, for my company I wrote turn-by-turn navigation software called FastNav, which requires files in a proprietary format called a "NaviMap" file, which are converted from Shapefiles. I thought it would be neat to "script" the conversion process with text files, using ANTLR to parse expressions that would then be interpreted, but I was concerned that interpreting dynamically-typed expressions would be slow. So I spent a lot of time writing a little compiler that converted these textual expressions to statically-typed, compiled MSIL, typed according to the schema of the Shapefile.

So now I have this ultra-fast expression evaluator. But guess what? My MapConverter is still slow to this day, because it turned out that the bottleneck was elsewhere (I improved the worst bottleneck, which left other bottlenecks that were difficult to fix. Users weren't complaining, so I let it be).

But FastNav itself was also too slow, running on a 400MHz WinCE device, and I spent six months just optimizing it until my boss was satisfied (and that was after writing all the drawing primitives from scratch because WinCE drawing code is abysmal and AGG, while faster than WinCE itself, was still too slow). There are no good profilers for WinCE, and over time I developed an intuition that the bottlenecks of the code on WinCE were dramatically different than the bottlenecks on the desktop (actually, on the desktop version, there weren't really any bottlenecks to speak of); so much so that desktop profiling results were completely useless. So I painstakingly benchmarked flash I/O performance (horribly slow), floating point (horribly slow), font drawing (horribly slow if the text is rotated, fast otherwise), and all the various modules of FastNav, optimizing each one until the product was finally usable. I even optimized std::vector (writing a replacement called inline_vector and then, finding that this didn't really help, a simpler replacement called mini_vector), implemented my own hashtables, and replaced the memory manager (new and delete) to optimize small allocations.

Optimization has always been an important and necessary part of my work. Around 1998 I wrote the (now-dead) Super Nintendo emulator SNEqr in C++, but I was told that C optimizers are "really good now" and there's no reason to use assembly language anymore. Well, silly me, I believed them and wrote the CPU emulator in C--it was horrifically slow, and became about 10 times faster when I rewrote it in assembly language. And every program I write seems to end up with something that is too slow, something that either gets optimized--or that users just have to live with.

After all this experience, I have a tendency to optimize sooner rather than later, and it can be a bad habit because I may choose to optimize the wrong things--the non-bottlenecks.

But I'm convinced that premature optimization is nowhere near as bad as not giving a f*ck, which is a much more common practice. For example, how as it that every other program needs a splash screen and takes a few seconds to start on an idle system? I have never written a program that takes more than a second or two to start--except thanks to big and slow third-party components.

I recently made an app for Taxi dispatchers called IntelliMap. On my fast developer machine it takes at least 6 seconds to restart after having started once. But that's the WPF version. I originally wrote it with WinForms and that version restarts instantaneously, as if it was Notepad or something. But I was told that the user interface looked "too 90s" and should be modernized using WPF and Infragistics controls. Luckily I had used the MVVM pattern, and I was able to switch to new view code while using the same models and viewmodels. In fact, I kept the WinForms version operational in the same executable file. To this day you can start it with the "--winforms" switch and it starts instantly, albeit with a "90's" interface, and fewer features since I didn't bother maintaining the WinForms version.

The problem is worse than it sounds. Because while it might take 6 or 7 seconds to restart on my fast developer machine, it takes over 45 seconds to restart (not cold-start) on one of our client's machines. And it's not just startup time; the WPF UI is more sluggish, and uses a lot more memory.

This really ticks me off. I wrote a program that starts instantly. But then I had to use second- and third-party libraries that are hella slow. The "Premature optimization" argument says you should wait to optimize; wait until your application is slow, then profile the code to find and remove the bottlenecks. But there are three problems:

  1. If you're waiting for it to be slow on your fast developer machine, you're waiting too long. Some of your end users will have much older, slower hardware.
  2. If you don't have good habits, then your code will be slow throughout. So there won't just be one or two bottlenecks you have to optimize, but many; each bottleneck you fix will just cause the next one to become more apparent. If you have a habit of writing fast code, the bottlenecks will be fewer and you'll expend less effort optimizing (unless of course, the OS itself is slow, WinCE I'm talking to you).
  3. This argument is hard to apply to libraries. Apparently Microsoft and Infragistics felt that their WPF controls were fast and lean enough for them. But it's not fast enough for me! When writing a low-level library that other people will rely on, it's no good for a developer to wait until it's too slow for them. Libraries are used for many reasons by many people. Library code that is not a bottleneck in one application will surely become a bottleneck in some other application. Every application stresses low-level libraries somehow, but each app causes stress in a different place. This implies that core, oft-used libraries should be optimized uniformly. You might say "well, even so, it's only inner loops that need optimization". But lots of non-loops need optimization too, just in case the client application calls those non-loops inside an important loop.

In my opinion, the lower level the code is, the more important its speed is. I write a lot of low-level code, so speed is almost always important to me. And it's irritating to have to rely on slow libraries written by others, especially closed-source commercial stuff that I cannot even understand, let alone do anything about.

And when it comes to code that is used by lots of different people, premature optimization of the public interface or the system architecture may be warranted even when optimization of the implementation is not. I don't have any great examples handy, but consider the IEnumerator interface: you have to call MoveNext() and then Current--two interface calls per iteration. Since this is a fundamental, ubiquitous interface, used constantly by everyone and often used in tight loops, it would have been good if it could iterate and return the next item with a single interface call: bool MoveNext(out T current). Since it's a public API though, it cannot be changed; that's why public interfaces need careful design up-front. (Of course, performance isn't the only factor in API design; things like flexibility and ease-of-use are equally important.)

You don't have to optimize alone

Some people seem to believe that there are two kinds of languages: fast and efficient languages that make the developer work harder, like C/C++, and "RAD" languages that are easy to use and productive, like Ruby or C#, but have speed limits at runtime. Some people have assumed you can't have runtime performance and productivity all in one language, and I reject that assumption in the strongest terms. That language can exist, should exist, and even does exist to some extent (e.g. D2).

The arguments against premature optimization are that it's a waste of time if you're optimizing the wrong thing, or that it makes code harder to understand, or that you might make a mistake and turn correct code into faulty code. But what if it was easy, and what if it didn't harm readability at all? Would there be a reason not to optimize then?

An obvious example of this kind of optimization--easy optimization that doesn't harm readability--is when you go to your compiler settings and enable optimizations. Ahh, all in a day's work! But the optimizer can't do everything, because you have knowledge that it does not, and it will probably never be smart enough to convert your linear search of a sorted list into a binary search.

In the long run, a big part of the solution is to give the compiler more knowledge, e.g. by using programming languages with features like "effects" and "global optimizations". Another big part of the solution is to have standard libraries that not only have lots of fast algorithms to call upon, but also make those algorithms easy to find and use. But there are also simple and easy optimizations that the programmer could use himself in his own code, that just require some tweaks to our programming languages. For instance, let's say you want to run some sort of search through some data structure, and scan the results only if there is more than one of them. It's so easy to write this:

  if (DoSearch().Results.Count > 1)
     foreach(var r in DoSearch().Results) {
        // do something with r
     }
Now, if we refactor it like this instead:
  var rs = DoSearch().Results;
  if (rs.Count > 1)
     foreach(var r in rs) {
        // do something with r
     }

That's 20 seconds we'll never get back. Or perhaps more than 20 seconds if this is an "else if" clause in a chain, because you'll have to spend some time thinking about how to refactor it:

  if (...) {
     ...
  } else if (DoSearch().Results.Count > 1) {
     foreach(var r in DoSearch().Results) {
        // do something with r
     }
  } else if (...) {
     ...
  }

Plus, refactoring makes the code longer now. So should we even bother? Isn't this the premature optimization that we were warned about? But, what if the data structure is large? We shouldn't do the search twice, should we?

I say this dilemma shouldn't exist; it is entirely a flaw in the programming language. That's why I defined the "quick binding" operator for EC#:

  if (DoSearch().Results=:rs.Count > 1)
     foreach(var r in rs) {
        // do something with r
     }

It may look weird at first (it's reverse of the := short variable declaration operator in Go, and I am considering whether to use "::" for this operator instead) but now there's no dillema. It's easy to create a new variable to hold the value of DoSearch().Results, so you may as well just do it and move on. No need to weigh the pros and cons of "premature optimization".

Another good example is LINQ. Suppose we have a long list of numbers and we'd like to derive another list of numbers. Doesn't matter exactly what the query says, here's one:

List<int> numbers = ... // get a list somehow
var numbers2 = (from n in numbers where n < 100000 select n + 1).ToList();
But LINQ involves a bunch of delegate methods which, given the way .NET is designed, cannot be inlined. You can get more speed using a loop like this:

for (int trial = 0; trial < 200; trial++)
{
  numbers2 = new List<int>();
  for (int i = 0; i < numbers.Count; i++) {
    int n = numbers[i];
    if (n < 100000)
      numbers2.Add(n + 1);
  }
}

I just benchmarked this on my PC (200 trials, 1000000 random numbers of at most 6 digits) and got:

LINQ:    2500ms (100579 results)
for:     859ms (100579 results)
foreach: 1703ms (100579 results)

So the plain for-loop is almost 3 times faster (surprisingly the foreach version, not shown, is only slightly faster.)

So, should you write a plain for-loop instead to get that extra speed? Usually, the answer is "of course not". But my answer is "of course not--your computer should write the plain for-loop instead".

This is one of many reasons why EC# will have a procedural macro system. So that end-users can optimize code themselves, using macros to detect certain code patterns and rewrite them into more efficient forms. Of course, the most common optimizations can be bundled into a DLL and shared, so most people will not write these transformations themselves. Typically, a user will simply have to write a global attribute like [assembly: LinqToForLoop] to install the macro in their program, or they could attach an attribute to a class or method for more conservative optimization.

I haven't actually figured out how the LinqToForLoop macro code would look. My thinking right now is that this type of macro, that looks for a code pattern and rewrites it, should "register" the kinds of nodes it wants to look at. The compiler will look for these nodes on the macro's behalf and give them to the macro when found. This will be more efficient than the obvious solution of simply passing "the whole program" to the macro and letting the macro find things itself. Since programmers will inevitably use lots of macros, it would be terribly inefficient for each one to scan the program separately.

< Published on >

Thursday, June 13, 2013

LLLPG, Loyc trees, and LES, oh my!

Background: LLLPG —

I normally don't let these internal "planning" posts go to CodeProject, but my planning has reached an advanced stage and I think it would be fun to hear some feedback on my new parser generator and my plan for a new programming language called LES or LEL (depending on what you're doing with it). The parser generator is called LLLPG (Loyc LL(k) Parser Generator). "Loyc" means "Language of Your Choice" and it is my project for creating a set of generic tools for creating programming languages and for translating code between different languages* ; meanwhile, LL(k) is one of the three most popular types of parsers for computer languages (the others being PEG and LALR(1)... perhaps ANTLR's LL(*) is more popular than LL(k) nowadays, but I'm counting LL(*) as a mere variation on LL(k).)

LLLPG is a recursive-decent parser generator, with a feature set (and syntax) comparable to ANTLR version 2. LLLPG 1.0 is about 80% complete. All that's missing is a parser that can accept grammars in the form of text, and a bit of polishing, but the last 15% is the most important--no one will want to use LLLPG without a nice friendly input language, and right now LLLPG cannot parse its own input.

And I won't be using a dedicated input language like ANTLR does; no, LLLPG is designed to be embedded inside another programming language. Initially you will use LLLPG the same way you would use ANTLR, by invoking an exe as a pre-build step at compile-time, to produce C# code which is then compiled. But eventually, LLLPG will merely be a syntax extension to another language, one of a hundred extensions that you might have installed. You will add LLLPG as a compile-time reference, the same way you might add "System.Core" as a run-time reference today. This means that using a parser generator will be no more difficult than using any other library. When I'm done, you'll be able to easily embed parsers anywhere you want them; creating a parser that is generated and optimized at compile-time will be no more difficult than making a "compiled Regex" is today. But of course, a "compiled Regex" is actually compiled at runtime, while LLLPG will work entirely at compile-time*.

LLLPG has a very different design philosophy than ANTLR. ANTLR is a "kitchen sink" parser generator; it tries to do everything for you, it requires its own runtime library, and if you really want to be good at ANTLR, you'll buy the book that explains how to use it. I can't compete with ANTLR on features; Terrance Parr has been writing ANTLR and its predecessor for almost 20 years now, I think. Instead, LLLPG is designed for simplicity and speed, and its features tend to be orthogonal rather than intertwined; it is fairly easy to learn, it tends to generate fast parsers, the generated code is simple, easy-to-follow, and C#-specific, and it does not need a special runtime library, just a single base class that you can write yourself or copy into your project. Here is its feature set:
  • It generates a prediction-based LL(k) parser with no backtracking. The value of k is the maximum lookahead; it can be customized separately for each rule.
  • It warns you about ambiguity, and you can easily suppress irrelevant warnings (earlier alternatives take priority). Like ANTLR, LLLPG warns by example ("Alternatives (2, 4) are ambiguous for input such as «0x_»"). In case of ambiguity with an exit branch, you can mark loops as greedy or nongreedy to suppress the warning.
  • It supports semantic predicates, denoted &{expr}, which are used for disambiguation. If two branches are ambiguous, you can use a semantic predicate to put a condition on one or both branches that will be used to disambiguate; expr is simply a C# expression.
  • Syntactic predicates, denoted &(foo), are very similar, except that instead of running a C# expression, LLLPG tests whether the input matches the syntax "foo". Syntactic predicates, also known as zero-width assersions, can perform unlimited lookahead, so you can use them to escape the limitations of LL(k) grammars.
  • A gate, denoted p => m, is an advanced (but very simple) mechanism that separates "prediction" from "matching".
  • Performance optimizations include "prematch analysis", switch() statement generation, saving the lookahead terminal(s) in variables, and not using exceptions anywhere.
  • Alts (alternatives, i.e. branch points) have two error-handling modes for unexpected input. LLLPG can either (1) simply run the "default branch", or (2) produce an error-handling branch (automatically or manually). The default branch is normally the last branch, but this can be overridden (either to assist error handling or as a performance optimization.)
  • Actions, denoted {code;}, are code snippets inserted into the parser.
  • The value of a terminal can be assigned to a variable using x:='a'..'z' to create a variable 'x', x='a'..'z' to assign the value to an existing variable 'x', or xs+='a'..'z' to add the value to a list 'xs'.
When LLLPG is formally released, I will repeat all of this in a CodeProject article, or a whole series of them, with more detail. For now, the above information is just background for the real subject of this post.

Bootstrapping LLLPG

When you're making a parser generator that accepts text input, you naturally want to express the parser generator's own syntax in its own language, and if possible, you'd like to use your own parser generator to parse that language. But without any parser, how do you begin? The process of making a parser that will be able to "parse itself" is called bootstrapping.

I started out by taking advantage of C# operator overloading. I am currently writing parsers using the following C# code:
Final syntax C# syntax
'x' C('x') -- or Sym("foo") for the symbol \foo
'a'..'z' Set("[a-z]")
'[' OtherRule ']' C('[') + OtherRule + ']'
a | b | c a | b | c
a / b / c a / b / c -- but a b / c d needs parens: (a + b) / (c + d)
Foo* Star(Foo)
Foo+ Plus(Foo)
Foo? Opt(Foo)
&{i==0} And("i==0") -- using a special #rawText node to hold "i==0"
{Foo(bar);} Stmt("Foo(bar)")
a => b Gate(a, b)
Note that my unit tests have to use the C# syntax, since the tests have to be written before the big grammar that describes LLLPG's final input language.

My original plan for bootstrapping was to make a parser for EC# in C# with LLLPG (EC# is Enhanced C#, and one of its features is a "token literal" that allows embedded DSLs), or at least a parser for just the parts of EC# I needed for bootstrapping. Once this partial parser was written using the above mappings, I could then write a second parser in "partial EC#", and this second parser would be able to parse full EC# as well as itself.

However, as I planned out the parser, I was rather annoyed at how complex the EC# language is (because of its progenitor C#, of course) and I am now planning to create a new syntax that is easier to parse, tentatively called LES, "Loyc Expression Syntax"*.

The syntax tree interchange format

A major part of my plans for Loyc is the concept of an "interchange format" for code. In three words, the concept is "XML for code"--a general representation of syntax trees for any language. The interchange format will
  • Assist tools that convert code between different languages, by providing a way to change a tree incrementally from one language until it conforms to the rules of another language.
  • If a compiler uses Loyc trees internally, it can assist people writing compilers and compiler extensions by providing a simple textual representation of the compiler's intermediate output at various stages.
  • Promote compile-time metaprogramming to the entire software industry.
Let me be clear, I do not advocate XML as an interchange format. In fact, I hate XML! If I have to type &lt; one more time I could scream. Rather, I want to create a standard language for describing syntax trees, just as XML or JSON is used to describe data. (In fact, my language should be useful for describing data too, just like s-expressions can describe data, although that is not its main purpose).

The interchange format will serialize and deserialize text into an in-memory form called a Loyc tree. Every node in a Loyc tree is one of three things:
  • An identifier, such as the name of a variable, type, method or keyword.
  • A literal, such as an integer, double, string or character.
  • A "call", which represents either a method call, or a construct like a "class" or a "for loop".
Unlike in most programming languages, Loyc identifiers can be any string--any string at all. Even identifiers like \n\0 (a linefeed and a null character) are supported. This design guarantees that a Loyc tree can represent an identifier from any programming language on Earth. Literals, similarly, can be any object whatsoever, but when I say that I am referring to a Loyc tree that exists in memory. When a Loyc tree is serialized to text, obviously it will be limited to certain kinds of literals (depending on the language used for serialization).

Each Loyc node also has a list of "attributes" (usually empty), and each attribute is itself a Loyc tree.

Loyc trees are closely related to, and inspired by, LISP trees. If you've heard of LISP, well, the Loyc tree is basically a 21st century version of the S-expression. The main differences between s-expressions and Loyc trees are
  1. Each "call" has a "target". Whereas LISP represents a method call with (method arg1 arg2), Loyc represents a method call with method(arg1, arg2). In LISP, the method name is simply the first item in a list; but most other programming languages separate the "target" from the argument list, hence target(arg1, arg2). If you want to create a simple list rather than a method call, the identifier "#" (tentatively) means "list" by convention, as in #(item1, item2, item3).
  2. Each node has a list of attributes. The concept of attributes was inspired by .NET attributes, so it will not surprise you to learn that a .NET attribute will be represented in a Loyc tree by a Loyc attribute. Also, modifiers like "public", "private", "override" and "static" will be represented by attaching attributes to a definition.
  3. A "call", which represents either a method call, or a construct like a "class" or a "for loop". By convention, constructs that are built into a language use a special identifier that starts with "#", such as #class or #for or #public.
  4. Each node has an associated source file and two integers that identify the range of characters that the original construct occupied in source code. If a Loyc tree is created programmatically, a dummy source file and a zero-length range will be used.
Obviously, a text format is needed for Loyc trees. However, I think I can do better than just an interchange format, so I have a plan to make LES into both an interchange format and a general-purpose programming language in its own right.

Since LES can represent syntax from any language, I thought it was sensible to design it with no keywords. So tentatively, LES has no reserved words whatsoever, and is made almost entirely of "expressions". But "expressions" support a type of syntactic sugar called "superexpressions", which resemble keyword-based statements in several other languages.

I've made a somewhat radical decision to make LES partially whitespace-sensitive. I do not make this decision lightly, because I generally prefer whitespace to be ignored inside expressions*, but the whitespace sensitivity is the key to making it keyword-free. More on that later.

My original plan was to use a subset of Enhanced C# as my "XML for code". However, since EC# is based on C# it inherits some very strange syntax elements. Consider the fact that (a<b>.c)(x) is classified a "cast" while (a<b>+c)(x) is classified as a method call (As I am writing this in HTML, the < made me scream just now. Sorry for the noise.) Features like that create unnecessary complication that should not exist in an AST interchange format.

Meet LES: tentative design

As I describe the design for Loyc Expression Syntax, keep in mind that this is all tentative and I could be persuaded to design it differently. I have the following goals:
  • Concision: LES is not XML! It should not be weighed down by its own syntax.
  • Familiarity: LES should resemble popular languages.
  • Utility: although LES can represent syntax trees from any language, its syntax should be powerful enough to be the basis for a general-purpose language.
  • Light on parenthesis: I'm sick of typing lots of parenthesis in C-family languages. LES is my chance to decrease the amount of Shift-9ing and Shift-0ing that programmers do every day.
  • Simplicity: the parser should be as simple as possible while still meeting the above goals, because (1) this will be the default parser and printer for Loyc trees and will be bundled with the Loyc tree code, so it should not be very large, and (2) this is the bootstrapping language for LLLPG so (for my own sake) it can't be too complex.
First let me describe a simple subset of LES, the "prefix notation". Prefix notation is nothing more than expressing everything as if it was made out of method calls. Basically it's LISP, but with Loyc trees instead of S-expressions.

For example, C# code like
   if (x > 0) {
      Console.WriteLine("{0} widgets were completed.", x);
      return x;
   } else
      throw new NoWidgetsException();
might be represented in prefix notation as
   #if(@#>(x, 0), @`#{}`(
      @#.(Console, WriteLine)("{0} widgets were completed.", x),
      #return(x)
   ),
      #throw(#new(NoWidgetsException()))
   );
The prefix "@" introduces an identifier that contains special characters, and then the backquotes are added for identifiers with very special characters. So in @`#{}`, the actual identifier name is #{}, which is the built-in "function" that represents a braced block. Similarly @#> and @#. represent the identifiers "#>" and "#.". Please note that '#' is not considered a special character, nor is it an operator; in LES, # is just an identifier character, just as an underscore _ is an identifier character in the C family. However, # is special by convention, because it is used to mark identifiers that represent keywords and operators in a programming language.

Pure prefix notation is a bit clunky, so it will not be used very much. LES will have built-in operators so that the above code can be represented in a more friendly way:
   #if x > 0 {
      Console.WriteLine("{0} widgets were completed.", x);
      #return x;
   }
      #throw #new(NoWidgetsException());
However, this notation might be confusing when you first see it. "Why is there no #else?" you might wonder. "How do I even begin to parse that," you might say.

Well, here's my plan:
  • Braces {...} are simply subexpressions that will be treated as calls to the special identifier #{}. Inside braces, each argument to the #{} pseudo-function will end with a semicolon (the semicolon on the final statement is optional).
  • To make an identifier that contains punctuation, use the @ prefix. Such identifiers can contain both punctuation and identifier characters, e.g. @you+me=win!. Note that when I talk about "punctuation", the characters , ; { } ( ) [ ] ` " # do not count as punctuation (# counts as an identifier character). If you need extra-special characters such as these, enclose the identifier in backquotes: @`{\n}`. Escape sequences are allowed within the backquotes, including \` for the backquote itself. Identifiers can contain apostrophes without using @, e.g. That'sJustGreat (there are several languages that use apostrophes in identifiers). However, an apostrophe cannot be the first character or it will be interpreted as a character literal (e.g. 'A' is the character 'A').
  • There will be an unlimited number of binary operators such as +, -, !@$*, >>, and so on. An operator such as &|^ will correspond to a call to #&|^, e.g. x != 0 is simply a friendly representation of @#!=(x, 0). Simplicity being an important design element of LES, the precedence of each operator will be determined automatically based on the name of the operator and a fixed set of rules. For example, the operator ~= will have a very low precedence based on a rule that says "an operator whose name ends in '=' will have the same precedence as the assignment operator '='". This design allows LES to be "context-free"; an LES parser is virtually stateless and does not need a "symbol table" of any kind. A person reading LES only needs to know the precedence rules in order to figure out the precedence of a new operator that he or she has never seen before.
  • Operators whose names do not start with # can be formed with the \ prefix. Such operators can contain alphanumerics and punctuation, e.g. \div and \++ are binary or prefix operators; a \div b is equivalent to div(a, b) and a \++ b is equivalent to @++(a, b). Again, if you need extra-special characters, enclose the operator in backquotes: \`/*\n*/`.
  • LES will have "super-expressions", which are multiple expressions juxtaposed together with no separator except a space. If a "super-expression" contains more than one expression, the first expression contains a call target and the other expressions are arguments. For example, a.b c + d e; parses as three separate expressions: a.b, c + d, and e. After gathering this expression list, the parser treats the final "tight" expression in the first expression as "calling" the others (explanation below). Thus, together, the three expressions will be shorthand for a.b(c + d, e);.*
I realize that "superexpressions" might be confusing at first, but I think this design allows a rather elegant way to write "expressions" as if they were "statements". For example, the following is a valid expression:
if a == null {
  b();
} else {
  c();
};
and it actually means
if(a == null, { b(); }, else, { c(); });
which, in turn, is shorthand for
if(@#==(a, null), @`#{}`( b(); ), else, @`#{}`( c(); ));
All three inputs are parsed into the same Loyc tree.

Please note that none of this has any "built-in" meaning to LES. LES doesn't care if there is an "else" or not, it doesn't care if you use "#if" or "if", LES doesn't give any built-in semantics to anything. LES is merely a syntax--a text representation of a Loyc tree.

So the answer to questions like "why is there no #else" or "why did you use '#if' in one place, but 'if' in another place" is, first of all, that LES doesn't define any rules. Programming languages define these kinds of rules, but LES is not a programming language. It is just a textual representation of a data structure.

It is easy enough for the parser to parse a series of expressions. Basically, whenever you see two identifiers side-by-side (or other "atoms" such as numbers or braced blocks), like foo bar, that's the boundary between two expressions in a superexpression. When the parser sees a superexpression, it must decide how to join the "tail" expressions to the "head" expression. For example, this superexpression:
a = b.c {x = y} z;
Consists of the expressions "a = b.c", "{x = y}" and "z". How will these expressions be combined? My plan is basically that the other expressions are added to the first expression as though you had written this instead:
a = b.c({x = y}, z);
So it's as if the first space character that separates the first and second expressions is replaced with '(', a ')' is added at the end, and commas are added between the other expressions. "b.c" is what I am calling a "tight expression" because the operators in the expression (e.g. ".") are tightly bound (high precedence), and the expression is normally written without space characters. Confused yet?

The point of a "superexpression" is that you can write statements that look as if they were built into the language, even though they are interpreted by a fixed-function parser that has no idea what they mean. For example, in LES you can write
try {
   file.Write("something");
} catch(e::Exception) {
   MessageBox.Show("EPIC I/O FAIL");
} finally {
   file.Dispose();
};
And the parser handles this without any clue what "try", "catch" and "finally" mean.

Superexpressions also allow you to embed statement-like constructs inside expressions, like this:
x = (y * if z > 0 z else 0) + 1;
and the LES parser will understand that "z > 0", "z", "else" and "0" are arguments in a call to "if".

A syntax highlighting engine for LES should highlight the "tight expression" to indicate how the superexpression is interpreted. This syntax highlighting rule would highlight words like "if" and "try" when they are used this way (but not "else" or "finally", which are merely arguments to "if" and "try"; the parser cannot know that they have any special meaning.)

It's not quite as simple as I implied, though. If the first expression already ends with a method call:
a = b.c(foo) {x = y} z;
I think it would be better to interpret that as
a = b.c(foo, {x = y}, z);
rather than
a = b.c(foo)({x = y}, z);
A motivation for this exception is explained later.

Only one superexpression can exist per subexpression. For example, an expression with two sets of parenthesis, ... (...) ... (...) ... can contain at most three superexpressions: one at the outer level and one in each set of parenthesis.

You cannot "nest superexpressions" without parenthesis. for example, an expression of the form if x if y a else b else c can be parsed as if(x, if, y, a, else, b, else, c), but it won't mean anything because the second "if" is merely one of 8 arguments to the first "if"; there is nothing to indicate that it is special.

Did you notice how this plan is space-sensitive?

In LES you normally separate expressions with a space. Operators--that is, sequences of punctuation--are assumed to be binary if possible and unary if not. So "x + y" and "- z" are understood to be single expressions with binary and unary operators, respectively. When you put two of these expressions side-by-side, such as "- z x + y", it is seen as two separate expressions. If you write "x + y - z", on the other hand, this is understood as a single expression because "-" is assumed to be a binary operator in the second case. "x + y (- z)" is parsed as two expressions again. The whitespace between operators and arguments doesn't matter much; "x+y (-z)" is also acceptable and means the same thing. But the space between the expressions is important; "x+y (-z)" is two separate expressions, while "x+y(-z)" is a single expression meaning "x + (y(-z))"; here, y is the target of a method call.

I know it's not ideal. But in my experience, most programmers write
 if (...) {...}
 for (...) {...}
 lock (...) {...}
 while (...) {...}
 switch (...) {...}
with a space between the keyword and the opening paren, while they write
 Console.WriteLine(...);
 var re = new Regex(...);
 F(x);
with no space between the identifier and the method being called. So in a language that has no real keywords, it seems reasonable, although not ideal, to use a spacing rule to identify whether the first word of an expression is to be treated as a "keyword" or not.

So the spacing rule exists to identify "keyword statements" in a language that has no keywords. LEL can parse anything that looks like a keyword statement:
 if ... {...};
 for ... {...};
 lock ... {...};
 while ... {...};
 switch ... {...};
 unless ... {...};
 match ... {...};
But notice that a semicolon is needed at the end of each statement to separate it from the following statement. No parenthesis are required around the "subject" (e.g. the "c" in "if c"), but if parenthesis are present, the space character will ensure that the statement is still understood correctly. Statements like the above are automatically translated into an ordinary call,
 if(..., {...});
 for(..., {...});
 lock(..., {...});
 while(..., {...});
 switch(..., {...});
 unless(..., {...});
 match(..., {...});
There is another way that LES is whitespace-sensitive. Since LES contains an infinite number of operators, two adjacent operators must have spaces between them. For example, you cannot write x * -1 as x*-1 because *- will be seen as a single operator named #*-.

Attributes are expressed in LES with the square brackets:
[Flags] enum Emotion { 
   Happy = 1; Sad = 2; Angry = 4; Horndog = 65536;
};
There are some more details I could mention about LES, but you get the basic idea.

Meet LEL: the exact same language

LEL ("Loyc Expression Language") will be a programming language built on top of LES ("Loyc Expression Syntax"). It will use exactly the same parser as LES; in fact we can say that it "uses" LES, because LES consists of a parser and nothing else. LEL will then give meaning to the syntax. For example, in LES the word "if" is just an identifier, it has no built-in meaning. But in LEL, "if" will be defined as a standard macro that creates a standard "#if" expression, which will be part of some sort of common language that EC# and LEL will both use internally.

So in LEL,
if a > 0 {
  b();
} else {
  c();
};
means just what you would expect: call b() if a>0, otherwise call b(). However, LEL will not be a typical programming language. "if" will not be a built-in statement like it is in C#, for example. Instead, LEL will be a macro-heavy language, and "if" will be a standard macro. A macro, if you didn't know, is a special kind of method that runs at compile-time. Of course, LEL macros will be unrelated to C/C++ macros, which give the word "macro" a bad name. Like C/C++ macros, it is quite possible to abuse LEL macros; but unlike C/C++ macros, LEL macros will be fantastically powerful.

I don't have the details worked out about the macro system yet; I'll be sure to read more about Nemerle for inspiration.

The semantics of LEL will be based on EC#, and at first it will compile down to plain C# so it will not be able to surpass the runtime capabilities of C#, although it will still allow crazy-named identifiers by mapping punctuation to letters; for example @Save&Close will be represented in C# as something like Save_and_Close.

Things like classes and using statements are not hard to support with LES superexpressions, and I'm thinking about using :: for variable declarations:
using System.Collections.Generic;
[pub] class Point(ICloneable.[Point])
{
   [pub] X::int; // pub for public, see?
   [pub] Y::int;
};
I'm not an abbreviation freak, but [public] requires brackets around it, which is unfortunate, so I'll compensate for this extra syntactic noise by letting you abbreviate the attributes: public to pub, private to priv, static to stat, virtual to virt, and so on. Also, as in C#, you can separate attributes with a comma or use multiple bracket groups: [pub,stat] and [pub][stat] are equivalent.

To allow :: for variable declarations, I'm giving :: a precedence below dot (.) even though :: has the same precedence as dot in C# and C++. This allows s::System.String to parse as s::(System.String). (You might wonder why I don't just use a single colon : for this; well, for now I'll keep it as my little secret.)

".[...]" is the syntax for generic type arguments; ICloneable.[Point] has the interpretation #of(ICloneable, Point) which means ICloneable<Point> in C#. I could consider using the D syntax instead (ICloneable!Point), it's not set in stone. However, the C++/Java/C# syntax ICloneable<Point> is out of the question. It would be hard to parse and it makes the "superexpression" concept highly ambiguous.

It may seem possible to support C-style variable declarations like
String x;
But it really isn't; String x just means String(x)--it's an ordinary function call.

The method declaration syntax is a bit of a head-scratcher. I'm thinking about using this syntax:
[pub] Square(x::int)::int { x * x };
which is parsed as
[pub] (Square(x::int)::int)({ x * x });
[pub] @#::(Square(x::int), int)({ x * x }); // equivalent
[pub] @#::(Square(@#::(x, int)), int)(@`#{}`(x * x)); // equivalent
I'm just not sure how to write the specification for LEL in such a way that the compiler can make sense out of this peculiar representation. I'd also, just this once, like to lift the restriction against spaces between the method name and its arguments, so that you can write
@pub Square (x::int)::int { x * x };
but this form has the rather different syntax tree
@pub Square((x::int)::int, { x * x });
And this would be essentially indistinguishable from a call to a macro called "Square", so I'm a bit stumped. Hmm...

Let's talk about this weird spacing rule again, because you might wonder what happens if the spacing is incorrect. Obviously, the mistakes will be (1) to forget a space before '(', and (2) to use a space before '(' where there should not be one.

First consider the following example:
if(c) { a(); } else { b(); };
This will actually still work most of the time, because if there is already an argument list on the end of the first expression, the other expressions are added to that existing argument list (as I mentioned earlier). So this example is interpreted the same way as the correct version,
if (c) { a(); } else { b(); };
which means
if((c), { a(); }, else, { b(); });
But there is another version of this mistake that looks like this:
if(x+y) > z { a(); } else { b(); };
And this can be parsed as
if(x+y) > z({ a(); }, else, { b(); });
The 'if' macro itself could produce a somewhat understandable error message like "'if' expects 2 or 3 arguments but received only one. It looks like you forgot a space after 'if'." This would be followed by a complaint about there not being any method 'z' that takes 3 arguments.

The second mistake, adding a space when there should not be one, has different effects depending on the situation. In very simple cases like this one,
foo (x);
The interpretation is foo((x)); so it works. But in other cases, such as
x = foo (x) + 1;
a = foo (b, c);
The interpretations are different:
x = foo((x) + 1);
a = foo((b, c));
In the second case, (b, c) is parsed as a tuple (I haven't decided on the representation for tuples in LES yet), that is passed as a single parameter to foo().

Unfortunately, the code can still be parsed, but changes meaning. The primary defense against both of these mistakes is syntax highlighting: foo will show up in a different color when it is followed by a space than when it is not. I only hope that this distinction will be enough.

Assuming that foo() is a method, I suppose that LEL could prefer the "superexpression" syntax for macros, and issue a warning if a superexpression is used to call an ordinary method (the parser records the fact that "superexpression" syntax was used in the NodeStyle).

My current thinking is that the standard macros of LEL will essentially convert LEL into "built-in" constructs which are then compiled, and these constructs will be identical or very similar to EC# constructs, so that LEL and EC# can share the same back-end. With this design, LEL can easily be converted to C#, since I am planning to write an EC#-to-C# compiler. So the "if" macro, for example, will take input such as
if (a) b else c;
and produce output such as
#if (a) b c;
or in other words,
#if((a), b, c);

And now back to LLLPG

Remember, besides designing two new programming languages (LEL and EC#) I'm also writing a parser generator that will be used to describe both of these languages. After writing an LES parser, I will use it to describe the full grammar of EC#.

Please note that I already wrote a complete node printer for the EC# language, and the parser generator uses this to generate plain C# code (EC# is a superset of C#, so the parser generator simply generates a Loyc tree representing C# code, and prints it out using the EC# node printer.)

Right now, I can define a grammar in plain C# like this:
Rule Int = Rule("Int", Plus(Set("[0-9]")), Token);
// represents the grammar '0'..'9'+, which in LES might be written
//    Int @[ '0'..'9'+ ];
// Here, @[...] is a "token literal". It is tokenized by LES, but
// otherwise uninterpreted. This allows the LLLPG macro that is 
// parsing the grammar to interpret it instead.
and then I can generate C# code like this:
LLLPG.AddRule(Int);
LNode result = LLLPG.GenerateCode();
string CSharpCode = result.Print();
Where "LNode" is a "Loyc node", the data type of a Loyc tree. Here is the generated code:
{
  private void Int()
  {
    int la0;
    MatchRange('0', '9');
    for (;;) {
      la0 = LA0;
      if (la0 >= '0' && la0 <= '9')
        Skip();
      else
        break;
    }
  }
}
My end goal is to let you write a grammar in EC# syntax, which will look something like this:
using System;
using System.Collections.Generic;

[[LLLPG(IntStream)]]
class MyLexer {
  private Number ==> @[ 
    &('0'..'9'|'.')
    '0'..'9'* ('.' '0'..'9'+)?
  ];
  private Id      ==> @[ IdStart IdCont* ];
  private IdStart ==> @[ 'a'..'z'|'A'..'Z'|'_' ];
  private IdCont  ==> @[ IdStart|'0'..'9' ];
  private Spaces  ==> @[ ' '|'\t'|'\n'|'\r' ];
  public  Token   ==> @[ Id | Spaces | Number ];

  public void NormalMethod()
  {
    NormalCSharpCode();
  }
}
"==>" is the new EC# "forwarding operator". Never mind what that means, because in this case I have hijacked it for expressing LLLPG rules. Again, @[...] is a token literal that LLLPG will parse. "IntStream" refers to a helper class for code generation; it means that the input is a series of integers (actually characters, plus -1 to represent EOF).

So how do I get from here to there? Here's my plan for the bootstrapping process.
  1. First, I'll write a grammar for LES in C#, using LLLPG for code generation.
  2. Next, I'll write a skeleton version of LEL that can run simple non-hygienic macros (hygienic macros I can add later, when more infrastructure exists).
  3. One of the macros will gather input for LLLPG. Keep in mind that at this point there is no parser for the token literals--no parser that can understand input like "0x" ('0'..'9'|'a'..'f')+. So instead I will use LES as an intermediate language. LES supports suffix operators that start with "\\", so I could express a grammar with a syntax like this:
    LLLPG' IntStream (
      [pub] class MyLexer() {
        [priv] rule Number
          &('0'::'9'|'.') +
          '0'::'9'\\* + ('.' + '0'::'9'\\+)\\?;
        [priv] rule Id      IdStart IdCont\\*;
        [priv] rule IdStart 'a'::'z'|'A'::'Z'|'_';
        [priv] rule IdCont  IdStart|'0'::'9';
        [priv] rule Spaces  ' '|'\t'|'\n'|'\r';
        [pub]  rule Token   Id | Spaces | Number;
      }
    });
    
    Here I have changed '..' to '::' because '::' has higher precedence than '|' and '+' ('..' has lower precedence than both of them). The backslashes "\\" inform the LES parser that the "\\*" and "\\?" operators are suffix operators, not binary operators. This syntax isn't entirely comfortable, but it's better than the C# representation.
  4. The LLLPG macro will take the Loyc tree that came from the LES parser and translate it into a series of "Rule" objects. These objects are the "real" input to LLLPG's core engine. This conversion process will be written in plain C#.
  5. Next, I'll use this syntax to write a parser for the token literals. This parser will simply translate the token literals into the form you see above, so
    LLLPG IntStream (
      [pub] class MyLexer() {
        [priv] Number @[ 
          &('0'..'9'|'.') 
          '0'..'9'* ('.' '0'..'9'+)?
        ];
        [priv] Id      @[ IdStart IdCont* ];
        [priv] IdStart @[ 'a'..'z'|'A'..'Z'|'_' ];
        [priv] IdCont  @[ IdStart|'0'..'9' ];
        [priv] Spaces  @[ ' '|'\t'|'\n'|'\r' ];
        [pub]  Token   @[ Id | Spaces | Number ];
      }
    });
    
    Is translated to the form above. Thus, translating from LES to LLLPG Rules is a two-stage process with two separate interpreters. The reason for doing it this way is to allow third-party macros to run in between the two stages. My current thinking is that LEL could contain "high priority" and "low priority" lexical macros. First the high-priority macros run, then the low-priority ones run on a separate pass.

    Notice that I used "LLLPG'" in the first code listing but "LLLPG" in the second listing. LLLPG will be the high-priority macro and LLLPG' the low-priority second stage. This allows other lexical macros to process the code after the first stage, which will help end-users to do fancy things like generating grammars at compile-time.

    To tell you the truth, I have never seriously programmed in a language with a LISP-style macro system, so my approach may be flawed. I guess we'll find out.
  6. With both LLLPG fully implemented in LES, I can finally write a parser for the entire EC# language, in LES. (and remember, the LES code is translated to C# for compilation.) Once that's done, LLLPG can also support EC# as a host language with minimal changes. Bootstrapping complete!
I hereby open the floor to your comments and questions.

< Published on >

Saturday, June 1, 2013

D-style ranges in C# (.NET)

Ranges are the most important concept in the design of modern D collection code, and I want them to be included under the Loyc umbrella (but for now, don't worry what Loyc is.)

As I wrote in an as-yet unpublished article on about the design of my new language, Enhanced C#:
The really key thing I like about .NET is that it is specifically a multi-language, multi-OS platform with a standard binary format on all platforms--much like Java, but technically superior. It's only "multi-OS", of course, insofar as you avoid Windows-only libraries such as WPF and WCF, and I would encourage you to do so (if your boss will allow it). .NET solves the "integration gap" as long as the languages you want to mash up are available in .NET.

Without a multi-language runtime, interoperability is limited to the lowest common denominator, which is C, which is an extremely painful limitation. Modern languages should be able to interact on a much higher level.

A multi-language platform avoids a key problem with most new languages: they define their own new "standard library" from scratch. You know what they say about standards, that the nice thing about standards is that there are so many to choose from? I hate that! I don't want to learn a new standard library with every new language I learn. Nor do I want to be locked into a certain API based on the operating system (Windows APIs and Linux APIs). And all the engineering effort that goes into the plethora of "standard" libraries could be better spent on other things. The existence of many standard libraries increases learning curves and severely limits interoperability between languages.

In my opinion, this is the most important problem .NET solves; there is just one standard library for all languages and all operating systems (plus an optional "extra" standard library per language, if necessary). All languages and all operating systems are interoperable--it's really nice! The .NET standard library (called the BCL, base class library) definitely could be and should be designed better, but at least they have the Right Idea.
Unfortunately, while the .NET standard library has a lot of useful functionality, it has a lot of missing pieces, and pieces whose design was poorly thought out.

Since Loyc is all about interoperability between languages, it makes sense that there should be a "Loyc standard library" which provides similar functionality in a variety of programming languages. The .NET BCL is a good start, but not quite good enough. I certainly haven't had time to design a full-fledged standard library, but I'd like to talk today about some ideas I have about collections.

One of the most important elements of any standard library is the way it handles collections. Unfortunately, .NET's collection interfaces are a "bare minimum" design. There's IEnumerable(T), ICollection(T), IList(T), IDictionary(T) and that's about all.
  • Read-only lists must use the same interface as writable lists. This reduces type safety, and a read-only list is incompatible with .NET generics variance. Plus, implementing the entire IList(T) interface is a huge pain in the butt when you only want to provide read-only access.
  • Similarly for read-only collections: you can either implement just IEnumerable(T)--but then Count and Contains() are inaccessible--or the full ICollection(T) interface with Clear(), Add(), Remove(), and even implementing IsReadOnly and CopyTo is a chore.
  • There are no interfaces with AddRange() or InsertRange()
  • There is no support for slices (sections of a list or a sequence)--this is the single biggest problem in my opinion.
  • Although an enumerator can sometimes be restarted from the beginning with Reset(), you cannot save the current position and start from there later.
  • There is no specific support for backward enumeration, even though most collections could easily do it.
  • Various elements of the design are less efficient than they could be, such as the need to call two interface methods per iteration of IEnumerable, or the fact that two separate operations are required to perform operations such as "add value to dictionary if key does not already exist" (known as AddIfNotPresent in my new mutable dictionary, MMap<T>) or "get value and remove pair" (GetAndRemove).
  • There's lots of other stuff one could ask for.
Here are some of the collection interfaces my Loyc.Essentials library includes already:
  • ISource(out T): IEnumerable(T) plus ICount, which represents the Count property alone (covariant)
  • IListSource(out T): a simple read-only list collection (covariant)
  • INegListSource(T), INegArray(T) and INegAutoSizeArray(T): interfaces that represent lists that do not start at index zero
  • IPush(T), IPop(T), IStack(T), IQueue(T): interfaces with "push" and "pop" actions; IStack(T) and IQueue(T) are identical, but represent different behaviors
  • ISetImm(T,SetT): interface that represents an immutable set, with sub-interfaces ISetOperations(T,SetT) and ISetTests(SetT). (SetT is the type that implements the interface. It's a bit inconvenient to use interfaces parameterized on the set type itself, but I found that actually implementing a set interface that is not parameterized on itself is a damned inconvenience, but I digress.)
  • IArray(T): interface that represents a collection that can be modified, but cannot be resized.
  • IAutoSizeArray(T): an IArray(T) that resizes itself automatically when you assign an item to a new index.
  • IAddRange(T), IListRangeMethods(T): interfaces that contain additional collection methods such as AddRange() and RemoveRange()
  • ISinkCollection(in T), ISinkArray(in T), ISinkList(in T): interfaces that represent collections you can write to, but not read from (contravariant).
  • ICollectionEx(T): combines ISource(T), ISinkCollection(T), IAddRange(T) and RemoveAll(Predicate(T)), and of course ICollection(T).
  • IListEx(T): combines ICollectionEx(T), IArray(T), ISinkList(T), and of course IList(T).
I have observed that there are never quite enough interfaces: some people want an interface that includes AddRange(...), others will want an interface with RemoveAll(...); somebody will implement a collection that can Add() but not Remove() items; and almost no one actually calls CopyTo(). To meet everyone's needs, Go-style interfaces, which .NET does not support, would really help. With these, the missing interfaces can be retrofitted onto existing classes. And certainly Go-style interfaces are the best workaround for a standard library like the .NET BCL, whose built-in interfaces are so terribly impoverished.

But without support for Go-style interfaces, the best alternative is to define a large number of interfaces, and to provide adapter types to coerce old collections to the new interfaces.

Anyway, all of this is just background for the real topic of today's post: Ranges.

Ranges are an improvement on the C++ concept of iterators. I don't exactly know how ranges were invented in D, but perhaps someone noticed that most of the C++ STL algorithms require pairs of iterators:
bool            all_of(InputIterator first, InputIterator last, UnaryPredicate pred);
Function      for_each(InputIterator first, InputIterator last, Function fn);
InputIterator     find(InputIterator first, InputIterator last, const T& val);
difference_type  count(InputIterator first, InputIterator last, const T& val);
ForwardIterator unique(ForwardIterator first, ForwardIterator last, BinaryPredicate pred);
void           reverse(BidirectionalIterator first, BidirectionalIterator last);
void              sort(RandomAccessIterator first, RandomAccessIterator last, Compare comp);
ForwardIterator lower_bound(ForwardIterator first, ForwardIterator last, const T& val);
In fact, most STL algorithms require exactly two iterators--a range--and none require only a single iterator (I think). Passing ranges as iterators tends to be highly cumbersome; you are constantly repeating yourself:
push_heap(currentNode().children.begin(), currentNode().children.end());
I once read an in-depth article that explained all of this in more detail--I can't find the article now, but it discussed a "corner case" in C++ that requires three iterators instead of two, as well as how STL algorithms that return a single iterator (within a range) can map to D by returning a range. Since D does not use the iterator concept, if a D algorithm wants to return a reference to a single element within a range, instead of returning an iterator, it will return a range whose "front" is the single element that the algorithm wanted to return. But this range can be moved forward as usual with "popFront". Thus, for example, if you search for "6" in a list of numbers from 1 to 10, the search algorithm would return a range from 6 to 10.

I frequently run into difficulty in .NET because it does not have a "range" concept. We have LINQ, but that's not really enough. For instance if I want to get a sequence representing the second half of an array, I could write
     array.Skip(array.Length/2)
But the Skip() operation just wasted array.Length/2 loop iterations skipping elements one at a time, wasting a lot of CPU time, and the return value is no longer array-like--it's just an IEnumerable that can only be read from beginning to end. A lot of programmers are willing to "solve" the problem by throwing away CPU time:
     array.Skip(array.Length/2).ToArray() // viola!
but that's not good enough for me.

I'd like to solve many of .NET's limitations by bringing ranges from D to C#. D ranges are explained in a tutorial at this link. In summary, D has four kinds of ranges:
  • InputRange: requires the empty, front and popFront() member functions
  • ForwardRange: additionally requires the save member function
  • BidirectionalRange: additionally requires the back and popBack() member functions
  • RandomAccessRange: additionally requires the [] operator, and a length property if the range is finite
Now, there are some important differences between D and C#, so D-style ranges cannot be ported directly to C#:
  • D has templated methods that can examine an object at compile time to decide how to treat it. For example, to support output ranges, the put(range, element) operation can call range.put(element) if it exists, or if there is no put() method, it can modify range.front and then call range.popFront() instead.
  • D ranges are generally easier to copy (although this is not guaranteed).
The issue of how to copy a range is most irksome. To make this situation clear, imagine a C++ iterator. C++ iterators are modeled after C++ pointers, and if you copy a pointer:
   int* x = &array[i];
   int* y = x;
   x++;
Then obviously, the two pointers are independent: modifying x did not affect y. The simple '=' operator makes a copy, and copies are independent of each other.

So that's how C++ iterators work too. When you transfer an iterator with '=', you expect the copies to be independent, and they are (when implemented properly).

Since ranges are essentially just pairs of iterators, ideally, you would expect ranges to work the same way (although I hear that in D, it doesn't always work that way.)

Unfortunately, in .NET it is generally impossible to make ranges work that way. Certainly some ranges can work that way, such as a simple array range, which would naturally be a struct type:
struct ArraySlice<T> : IRange<T> {
  T[] _array;
  int _start, _count;
  ...
  public T this[int index] { 
    get { 
      /* TODO: check if index is out of range */ 
      return _array[_start + index];
    }
  }
  ...
}
Unfortunately, using '=' to make a copy is impossible in general. First of all, if IRange(T) is an interface that represents a random-access range, simply casting ArraySlice(int) to IRange(int) will defeat the '=' operator, so that it no longer makes copies.

And even if every range type were a struct, that still doesn't mean that they could be safely copied with '='. Consider a forward range that walks a B+ tree. In order to walk the tree, it is necessary to have a stack--a list of parent nodes, which will be an array or perhaps a Stack(T):
struct TreeRange<T> : IFRange<T> { // forward range
  private Stack<Pair<InternalNode<T>, int>> _parents;
  private LeafNode<T> _curNode; // node being walked right now
  private int _index;           // the current item in _curNode
  ...
}
But C# does not allow any action to be taken when a struct is copied, so if the array or Stack is stored in a struct, the Stack will not be copied when the '=' operator is used, even though all the other fields are copied. The two copies will have independent _index and _curNode fields, but share the same Stack! Therefore, after you've copied the range, advancing one of the copies of the range will (after a few iterations, anyway) completely screw up the other copy, making it behave in some bizarre way.

D mostly solves this problem with a "postblit constructor": like C#, D copies value types bit-for-bit; but unlike C#, D then calls the "postblit constructor", known as this(this) in code, on the copy. This allows the copy to properly "finish" the copy operation.

I cannot add postblit constructors to C#, but I still think that the range concept is important enough to support. Therefore, I am adding range interfaces to my Loyc.Essentials library in the Loyc.Collections namespace. I recommend that if a range type is not safe to copy with '=' (such as the TreeRange example), it should be a class instead. Luckily, the .NET garbage collector is much faster than the one in D, so using a class does not usually cause significant performance problems.

Besides, D does not solve the copying problem completely:
  1. D has a class of ranges called "input ranges" that cannot be copied, and
  2. D ranges could be implemented as class objects, which are very nearly the same as C# class objects; the '=' operator does not copy classes
For this reason* D has a save() function (actually it is a property, curiously enough) on ranges that duplicates the range safely. (* I speak as if I know what I'm talking about, but please note that I have not written any real programs with D.)

For the C# version of ranges, the interfaces will inherit ICloneable<R> so you can copy ranges explicitly with Clone(); this is equivalent to the save() method in D.

Here are the proposed range interfaces: IFRange (forward range), IBRange (bidirectional range), and IRange (finite random-access range). Note that it is not really necessary to define an "input range"; the standard IEnumerator(T) interface is already equivalent to an input range. Perhaps I'll add an IIRange (infinite random-access range) later, and maybe an output range, but for now I have defined these three:
public interface IFRange<out T> : IEnumerable<T>, ICloneable<IFRange<T>>
{
  bool IsEmpty { get; }
  T First { get; }
  bool DropFirst();
  T PopFirst(out bool empty);
}
public interface IBRange<out T> : IFRange<T>, ICloneable<IBRange<T>>
{
  T Last { get; }
  bool DropLast();
  T PopLast(out bool empty);
}
public interface IRange<out T> : IBRange<T>, IListSource<T>, ICloneable<IRange<T>>
{
}
// IListSource<T> is a read-only list (not a range) and ISource<T>
// is simply IEnumerable plus Count. I am wondering whether the naming 
// is appropriate; I chose the name "Source" as opposed to "Sink": a 
// Source provides data, while Sink accepts data.
public interface IListSource<out T> : ISource<T>
{
  T this[int index] { get; }
  T TryGet(int index, ref bool fail);
  IRange<T> Slice(int start, int count);
}
And in addition there are three mutable variants, IMFRange (allows you to modify First), IMBRange (allows you to modify First and Last) and IMRange (allows you to modify any element).

Tentatively I have used the names "First" and "Last" instead of "Front" and "Back" because the collection classes I've already written contain "First" and "Last" properties. I'm debating whether to change this for consistency with D. D's "empty" becomes "IsEmpty" in .NET, for consistency with other properties such as "IsReadOnly".

"DropFirst()" and "DropLast()" could just as easily be called "PopFirst()" and "PopLast()" for consistency with D, but I would like to eventually have extension methods called "PopFirst()" and "PopLast()" which not only chop off the first element of the range, but return that first element at the same time. These extension methods do not currently exist, however, due to a limitation of C#. Since a range could be a struct type, the extension method must take the range by reference to guarantee that it will actually be modified! Unfortunately, "(this ref Range self)" is an illegal argument list in C#. Nevertheless, I hope Enhanced C# will eventually support this type of extension method.

Finally, I've added special Pop operations that D does not have, and a Slice() method from IListSource:
  T PopFirst(out bool empty);
  T PopLast(out bool empty);
  IRange<T> Slice(int start, int count);
Come to think of it, I don't know how D solves the problem of slicing random-access ranges. Probably something involving the "$" and ".." operators.

Anyway, the thing I wanted to highlight was that PopFirst contains the functionality of IsEmpty, First, and DropFirst() all in one single method. It saves the value of IsEmpty in the 'empty' parameter, gets the value of First, drops the first element, and returns the old value of First (if !empty). I have complained loudly in the past about the fact that IEnumerator(T) requires two interface calls just to yield a single element*, and D's range interface is 50% worse, requiring three method calls per iteration! In D, this is rarely a problem because D ranges are used directly rather than through interfaces, so the compiler can inline the calls to all three methods. But in .NET land, interfaces are used far more often, so the three calls are fairly expensive.

The question, though, is whether it's worth requiring both approaches. The "chatty" approach with three methods is more convenient for the average caller who doesn't care much about performance, but the single-call method is crucial for high performance and should not be eliminated. So what should I do? Keep the four methods, drop all but the last one, or have some kind of compromise?

If extension methods were compatible with ranges, a very reasonable compromise would be to keep just IsEmpty and PopFirst, but eliminate First and DropFirst. Then, two extension methods would perform PopFirst without requiring a cumbersome "out" argument:
public static T PopFirst<R, T>(this ref R range, T defaultValue) where R : IFRange<T>
{
  bool fail;
  T next = range.PopFirst(out fail);
  if (!fail)
    return next;
  else
    return defaultValue;
}
public static T PopFirst<R,T>(this ref R range) where R:IFRange<T>
{
  bool fail;
  T next = range.PopFirst(out fail);
  if (fail) throw new EmptySequenceException();
  return next;
}
But as I explained, these extension methods cannot currently exist--not as extension methods, anyway. You can, however, call the non-extension method that does exist, Range.PopFirst(ref r). It's better than nothing, anyway.

Another compromise would be to eliminate DropFirst(), but keep First in case you want to "peek" at the next value without removing it (note that requiring a First property may increase the complexity or run-time size of a range type; and technically it is possible to implement First() as an extension method, so it is not strictly necessary for it to be part of the interface, although it is potentially faster when it is built-in.)

Finally, I am wondering whether to actually use the term "random-access range" or whether to use the term "slice" instead. As far as I can tell, in D the term "slice" means "array range", i.e. a section of an array, not some other data type. But "slice" rolls off the tongue nicely, and I am tempted to use it for all random-access ranges. In that case, IRange(T) would be renamed ISlice(T). Opinions, anyone?

In any case, I am in the process of adding range support to my collections library, Loyc.Collections. Note that the collection interfaces will go in Loyc.Essentials, a library of small methods, simple data structures and core interfaces, while most of the actual collection implementations will go in Loyc.Collections. I will also have to review the existing interfaces to understand how they relate to ranges and whether one or more of them should be eliminated or combined with ranges somehow. Perhaps I'll post again when it's done.

Loyc.Collections contains some data structures I've published articles about already, such as the AList, VList and CPTrie, but also some interesting unpublished data structures such as the BList, BMultiMap, Set/MSet, Map/MMap, InvertibleSet, and DList. Let me know if you would like me to prioritize writing articles about these things (I also have a parser generator and a new programming language in the works, you know, plus I have to find the time to set up a web site for all of this).

P.S. An ominous footnote:

The mutable ranges do not include a Clone method due to a limitation of C#. C# does not support covariance, which means that every time a derived interface supports cloning, the implementing class is required to write a separate clone method. Read-only ranges already have to implement up to three clone methods: ICloneable(IFRange(T)), ICloneable(IBRange(T)), ICloneable(IRange(T)), and that's in addition to the Clone method for the concrete type! If mutable ranges also supported cloning, they would add up to three more clone methods, which is really getting out of hand.

Even without mutable clone methods, implementing all the ICloneables can be a chore. But you can just copy and paste these, inserting an appropriate constructor call for your range type:
  IFRange<T>  ICloneable<IFRange<T>>.Clone() { return Clone(); }
  IBRange<T>  ICloneable<IBRange<T>>.Clone() { return Clone(); }
  IRange<T>   ICloneable<IRange<T>> .Clone() { return Clone(); }
  public MyType Clone() { return new MyType(...); }
When complete, EC# will, of course, take care of this crap for you.

Update: it was pointed out to me that although algorithms usually need ranges, there is other code that has some need for ordinary iterators. In .NET we typically use an index when we need an iterator, but this only works for random-access data structures, and it is not as type-safe as an iterator (although it has the virtue of storage efficiency.) For non-random-access data structures, the solutions are either ad-hoc (e.g. LinkedListNode inside a LinkedList) or non-existant (SortedDictionary has neither an iterator nor a range type; an iterator would have been one solution for this question I had). Comments are welcome about how to deal with the need for iterators in .NET.

I suppose the problem with C++ iterators is that they are useless without external context: you can't increment or decrement one without also comparing it to begin() or end() in its container, which implies that the caller must manually keep track of which container it came from. Thus, an iterator is hardly an improvement over a plain-old integer index, the only advantages being
  1. You can dereference it without reference to its container (but this is nothing special; you can dereference an object reference too)
  2. Unlike an index, it's compatible with non-random-access data structures (and huge data structures, for which a 32-bit integer index does not suffice).
Perhaps the iterator concept could be improved by being made self-aware: if the iterator "knew" and could tell you when it was at the beginning or end of its container. This would increase the storage requirement in some circumstances but not others. For example, an iterator to a doubly-linked list node can tell when it is at the beginning or end, but an iterator to a singly-linked list node can only tell when it is at the end (or rather, when there are no more elements after the current one. If the iterator follows the 'null' pointer at the end of a linked list, it becomes null itself, and can no longer increment or decrement).

A pointer inside an array may or may not be able to tell if it is at the end depending on how arrays work. Perhaps the way D heap arrays work would allow an array iterator, implemented as a simple pointer, to still know when it is safe to increment and decrement. The simplest possible .NET array iterator is an array reference plus an index, and again the iterator can know when it is at the beginning or end of the array--except that if the iterator came from inside a range, it would be unaware of the range's boundaries, which may be smaller than the array's boundaries.

The question is, what does an appropriate iterator type look like? Could it follow an interface that would make it more useful than a C++ iterator, but still more efficient than a range?

< Published on >
< Comments on Reddit > because Redditors don't like to comment on the actual blog.