Saturday, June 1, 2013

D-style ranges in C# (.NET)

Ranges are the most important concept in the design of modern D collection code, and I want them to be included under the Loyc umbrella (but for now, don't worry what Loyc is.)

As I wrote in an as-yet unpublished article on about the design of my new language, Enhanced C#:
The really key thing I like about .NET is that it is specifically a multi-language, multi-OS platform with a standard binary format on all platforms--much like Java, but technically superior. It's only "multi-OS", of course, insofar as you avoid Windows-only libraries such as WPF and WCF, and I would encourage you to do so (if your boss will allow it). .NET solves the "integration gap" as long as the languages you want to mash up are available in .NET.

Without a multi-language runtime, interoperability is limited to the lowest common denominator, which is C, which is an extremely painful limitation. Modern languages should be able to interact on a much higher level.

A multi-language platform avoids a key problem with most new languages: they define their own new "standard library" from scratch. You know what they say about standards, that the nice thing about standards is that there are so many to choose from? I hate that! I don't want to learn a new standard library with every new language I learn. Nor do I want to be locked into a certain API based on the operating system (Windows APIs and Linux APIs). And all the engineering effort that goes into the plethora of "standard" libraries could be better spent on other things. The existence of many standard libraries increases learning curves and severely limits interoperability between languages.

In my opinion, this is the most important problem .NET solves; there is just one standard library for all languages and all operating systems (plus an optional "extra" standard library per language, if necessary). All languages and all operating systems are interoperable--it's really nice! The .NET standard library (called the BCL, base class library) definitely could be and should be designed better, but at least they have the Right Idea.
Unfortunately, while the .NET standard library has a lot of useful functionality, it has a lot of missing pieces, and pieces whose design was poorly thought out.

Since Loyc is all about interoperability between languages, it makes sense that there should be a "Loyc standard library" which provides similar functionality in a variety of programming languages. The .NET BCL is a good start, but not quite good enough. I certainly haven't had time to design a full-fledged standard library, but I'd like to talk today about some ideas I have about collections.

One of the most important elements of any standard library is the way it handles collections. Unfortunately, .NET's collection interfaces are a "bare minimum" design. There's IEnumerable(T), ICollection(T), IList(T), IDictionary(T) and that's about all.
  • Read-only lists must use the same interface as writable lists. This reduces type safety, and a read-only list is incompatible with .NET generics variance. Plus, implementing the entire IList(T) interface is a huge pain in the butt when you only want to provide read-only access.
  • Similarly for read-only collections: you can either implement just IEnumerable(T)--but then Count and Contains() are inaccessible--or the full ICollection(T) interface with Clear(), Add(), Remove(), and even implementing IsReadOnly and CopyTo is a chore.
  • There are no interfaces with AddRange() or InsertRange()
  • There is no support for slices (sections of a list or a sequence)--this is the single biggest problem in my opinion.
  • Although an enumerator can sometimes be restarted from the beginning with Reset(), you cannot save the current position and start from there later.
  • There is no specific support for backward enumeration, even though most collections could easily do it.
  • Various elements of the design are less efficient than they could be, such as the need to call two interface methods per iteration of IEnumerable, or the fact that two separate operations are required to perform operations such as "add value to dictionary if key does not already exist" (known as AddIfNotPresent in my new mutable dictionary, MMap<T>) or "get value and remove pair" (GetAndRemove).
  • There's lots of other stuff one could ask for.
Here are some of the collection interfaces my Loyc.Essentials library includes already:
  • ISource(out T): IEnumerable(T) plus ICount, which represents the Count property alone (covariant)
  • IListSource(out T): a simple read-only list collection (covariant)
  • INegListSource(T), INegArray(T) and INegAutoSizeArray(T): interfaces that represent lists that do not start at index zero
  • IPush(T), IPop(T), IStack(T), IQueue(T): interfaces with "push" and "pop" actions; IStack(T) and IQueue(T) are identical, but represent different behaviors
  • ISetImm(T,SetT): interface that represents an immutable set, with sub-interfaces ISetOperations(T,SetT) and ISetTests(SetT). (SetT is the type that implements the interface. It's a bit inconvenient to use interfaces parameterized on the set type itself, but I found that actually implementing a set interface that is not parameterized on itself is a damned inconvenience, but I digress.)
  • IArray(T): interface that represents a collection that can be modified, but cannot be resized.
  • IAutoSizeArray(T): an IArray(T) that resizes itself automatically when you assign an item to a new index.
  • IAddRange(T), IListRangeMethods(T): interfaces that contain additional collection methods such as AddRange() and RemoveRange()
  • ISinkCollection(in T), ISinkArray(in T), ISinkList(in T): interfaces that represent collections you can write to, but not read from (contravariant).
  • ICollectionEx(T): combines ISource(T), ISinkCollection(T), IAddRange(T) and RemoveAll(Predicate(T)), and of course ICollection(T).
  • IListEx(T): combines ICollectionEx(T), IArray(T), ISinkList(T), and of course IList(T).
I have observed that there are never quite enough interfaces: some people want an interface that includes AddRange(...), others will want an interface with RemoveAll(...); somebody will implement a collection that can Add() but not Remove() items; and almost no one actually calls CopyTo(). To meet everyone's needs, Go-style interfaces, which .NET does not support, would really help. With these, the missing interfaces can be retrofitted onto existing classes. And certainly Go-style interfaces are the best workaround for a standard library like the .NET BCL, whose built-in interfaces are so terribly impoverished.

But without support for Go-style interfaces, the best alternative is to define a large number of interfaces, and to provide adapter types to coerce old collections to the new interfaces.

Anyway, all of this is just background for the real topic of today's post: Ranges.

Ranges are an improvement on the C++ concept of iterators. I don't exactly know how ranges were invented in D, but perhaps someone noticed that most of the C++ STL algorithms require pairs of iterators:
bool            all_of(InputIterator first, InputIterator last, UnaryPredicate pred);
Function      for_each(InputIterator first, InputIterator last, Function fn);
InputIterator     find(InputIterator first, InputIterator last, const T& val);
difference_type  count(InputIterator first, InputIterator last, const T& val);
ForwardIterator unique(ForwardIterator first, ForwardIterator last, BinaryPredicate pred);
void           reverse(BidirectionalIterator first, BidirectionalIterator last);
void              sort(RandomAccessIterator first, RandomAccessIterator last, Compare comp);
ForwardIterator lower_bound(ForwardIterator first, ForwardIterator last, const T& val);
In fact, most STL algorithms require exactly two iterators--a range--and none require only a single iterator (I think). Passing ranges as iterators tends to be highly cumbersome; you are constantly repeating yourself:
push_heap(currentNode().children.begin(), currentNode().children.end());
I once read an in-depth article that explained all of this in more detail--I can't find the article now, but it discussed a "corner case" in C++ that requires three iterators instead of two, as well as how STL algorithms that return a single iterator (within a range) can map to D by returning a range. Since D does not use the iterator concept, if a D algorithm wants to return a reference to a single element within a range, instead of returning an iterator, it will return a range whose "front" is the single element that the algorithm wanted to return. But this range can be moved forward as usual with "popFront". Thus, for example, if you search for "6" in a list of numbers from 1 to 10, the search algorithm would return a range from 6 to 10.

I frequently run into difficulty in .NET because it does not have a "range" concept. We have LINQ, but that's not really enough. For instance if I want to get a sequence representing the second half of an array, I could write
     array.Skip(array.Length/2)
But the Skip() operation just wasted array.Length/2 loop iterations skipping elements one at a time, wasting a lot of CPU time, and the return value is no longer array-like--it's just an IEnumerable that can only be read from beginning to end. A lot of programmers are willing to "solve" the problem by throwing away CPU time:
     array.Skip(array.Length/2).ToArray() // viola!
but that's not good enough for me.

I'd like to solve many of .NET's limitations by bringing ranges from D to C#. D ranges are explained in a tutorial at this link. In summary, D has four kinds of ranges:
  • InputRange: requires the empty, front and popFront() member functions
  • ForwardRange: additionally requires the save member function
  • BidirectionalRange: additionally requires the back and popBack() member functions
  • RandomAccessRange: additionally requires the [] operator, and a length property if the range is finite
Now, there are some important differences between D and C#, so D-style ranges cannot be ported directly to C#:
  • D has templated methods that can examine an object at compile time to decide how to treat it. For example, to support output ranges, the put(range, element) operation can call range.put(element) if it exists, or if there is no put() method, it can modify range.front and then call range.popFront() instead.
  • D ranges are generally easier to copy (although this is not guaranteed).
The issue of how to copy a range is most irksome. To make this situation clear, imagine a C++ iterator. C++ iterators are modeled after C++ pointers, and if you copy a pointer:
   int* x = &array[i];
   int* y = x;
   x++;
Then obviously, the two pointers are independent: modifying x did not affect y. The simple '=' operator makes a copy, and copies are independent of each other.

So that's how C++ iterators work too. When you transfer an iterator with '=', you expect the copies to be independent, and they are (when implemented properly).

Since ranges are essentially just pairs of iterators, ideally, you would expect ranges to work the same way (although I hear that in D, it doesn't always work that way.)

Unfortunately, in .NET it is generally impossible to make ranges work that way. Certainly some ranges can work that way, such as a simple array range, which would naturally be a struct type:
struct ArraySlice<T> : IRange<T> {
  T[] _array;
  int _start, _count;
  ...
  public T this[int index] { 
    get { 
      /* TODO: check if index is out of range */ 
      return _array[_start + index];
    }
  }
  ...
}
Unfortunately, using '=' to make a copy is impossible in general. First of all, if IRange(T) is an interface that represents a random-access range, simply casting ArraySlice(int) to IRange(int) will defeat the '=' operator, so that it no longer makes copies.

And even if every range type were a struct, that still doesn't mean that they could be safely copied with '='. Consider a forward range that walks a B+ tree. In order to walk the tree, it is necessary to have a stack--a list of parent nodes, which will be an array or perhaps a Stack(T):
struct TreeRange<T> : IFRange<T> { // forward range
  private Stack<Pair<InternalNode<T>, int>> _parents;
  private LeafNode<T> _curNode; // node being walked right now
  private int _index;           // the current item in _curNode
  ...
}
But C# does not allow any action to be taken when a struct is copied, so if the array or Stack is stored in a struct, the Stack will not be copied when the '=' operator is used, even though all the other fields are copied. The two copies will have independent _index and _curNode fields, but share the same Stack! Therefore, after you've copied the range, advancing one of the copies of the range will (after a few iterations, anyway) completely screw up the other copy, making it behave in some bizarre way.

D mostly solves this problem with a "postblit constructor": like C#, D copies value types bit-for-bit; but unlike C#, D then calls the "postblit constructor", known as this(this) in code, on the copy. This allows the copy to properly "finish" the copy operation.

I cannot add postblit constructors to C#, but I still think that the range concept is important enough to support. Therefore, I am adding range interfaces to my Loyc.Essentials library in the Loyc.Collections namespace. I recommend that if a range type is not safe to copy with '=' (such as the TreeRange example), it should be a class instead. Luckily, the .NET garbage collector is much faster than the one in D, so using a class does not usually cause significant performance problems.

Besides, D does not solve the copying problem completely:
  1. D has a class of ranges called "input ranges" that cannot be copied, and
  2. D ranges could be implemented as class objects, which are very nearly the same as C# class objects; the '=' operator does not copy classes
For this reason* D has a save() function (actually it is a property, curiously enough) on ranges that duplicates the range safely. (* I speak as if I know what I'm talking about, but please note that I have not written any real programs with D.)

For the C# version of ranges, the interfaces will inherit ICloneable<R> so you can copy ranges explicitly with Clone(); this is equivalent to the save() method in D.

Here are the proposed range interfaces: IFRange (forward range), IBRange (bidirectional range), and IRange (finite random-access range). Note that it is not really necessary to define an "input range"; the standard IEnumerator(T) interface is already equivalent to an input range. Perhaps I'll add an IIRange (infinite random-access range) later, and maybe an output range, but for now I have defined these three:
public interface IFRange<out T> : IEnumerable<T>, ICloneable<IFRange<T>>
{
  bool IsEmpty { get; }
  T First { get; }
  bool DropFirst();
  T PopFirst(out bool empty);
}
public interface IBRange<out T> : IFRange<T>, ICloneable<IBRange<T>>
{
  T Last { get; }
  bool DropLast();
  T PopLast(out bool empty);
}
public interface IRange<out T> : IBRange<T>, IListSource<T>, ICloneable<IRange<T>>
{
}
// IListSource<T> is a read-only list (not a range) and ISource<T>
// is simply IEnumerable plus Count. I am wondering whether the naming 
// is appropriate; I chose the name "Source" as opposed to "Sink": a 
// Source provides data, while Sink accepts data.
public interface IListSource<out T> : ISource<T>
{
  T this[int index] { get; }
  T TryGet(int index, ref bool fail);
  IRange<T> Slice(int start, int count);
}
And in addition there are three mutable variants, IMFRange (allows you to modify First), IMBRange (allows you to modify First and Last) and IMRange (allows you to modify any element).

Tentatively I have used the names "First" and "Last" instead of "Front" and "Back" because the collection classes I've already written contain "First" and "Last" properties. I'm debating whether to change this for consistency with D. D's "empty" becomes "IsEmpty" in .NET, for consistency with other properties such as "IsReadOnly".

"DropFirst()" and "DropLast()" could just as easily be called "PopFirst()" and "PopLast()" for consistency with D, but I would like to eventually have extension methods called "PopFirst()" and "PopLast()" which not only chop off the first element of the range, but return that first element at the same time. These extension methods do not currently exist, however, due to a limitation of C#. Since a range could be a struct type, the extension method must take the range by reference to guarantee that it will actually be modified! Unfortunately, "(this ref Range self)" is an illegal argument list in C#. Nevertheless, I hope Enhanced C# will eventually support this type of extension method.

Finally, I've added special Pop operations that D does not have, and a Slice() method from IListSource:
  T PopFirst(out bool empty);
  T PopLast(out bool empty);
  IRange<T> Slice(int start, int count);
Come to think of it, I don't know how D solves the problem of slicing random-access ranges. Probably something involving the "$" and ".." operators.

Anyway, the thing I wanted to highlight was that PopFirst contains the functionality of IsEmpty, First, and DropFirst() all in one single method. It saves the value of IsEmpty in the 'empty' parameter, gets the value of First, drops the first element, and returns the old value of First (if !empty). I have complained loudly in the past about the fact that IEnumerator(T) requires two interface calls just to yield a single element*, and D's range interface is 50% worse, requiring three method calls per iteration! In D, this is rarely a problem because D ranges are used directly rather than through interfaces, so the compiler can inline the calls to all three methods. But in .NET land, interfaces are used far more often, so the three calls are fairly expensive.

The question, though, is whether it's worth requiring both approaches. The "chatty" approach with three methods is more convenient for the average caller who doesn't care much about performance, but the single-call method is crucial for high performance and should not be eliminated. So what should I do? Keep the four methods, drop all but the last one, or have some kind of compromise?

If extension methods were compatible with ranges, a very reasonable compromise would be to keep just IsEmpty and PopFirst, but eliminate First and DropFirst. Then, two extension methods would perform PopFirst without requiring a cumbersome "out" argument:
public static T PopFirst<R, T>(this ref R range, T defaultValue) where R : IFRange<T>
{
  bool fail;
  T next = range.PopFirst(out fail);
  if (!fail)
    return next;
  else
    return defaultValue;
}
public static T PopFirst<R,T>(this ref R range) where R:IFRange<T>
{
  bool fail;
  T next = range.PopFirst(out fail);
  if (fail) throw new EmptySequenceException();
  return next;
}
But as I explained, these extension methods cannot currently exist--not as extension methods, anyway. You can, however, call the non-extension method that does exist, Range.PopFirst(ref r). It's better than nothing, anyway.

Another compromise would be to eliminate DropFirst(), but keep First in case you want to "peek" at the next value without removing it (note that requiring a First property may increase the complexity or run-time size of a range type; and technically it is possible to implement First() as an extension method, so it is not strictly necessary for it to be part of the interface, although it is potentially faster when it is built-in.)

Finally, I am wondering whether to actually use the term "random-access range" or whether to use the term "slice" instead. As far as I can tell, in D the term "slice" means "array range", i.e. a section of an array, not some other data type. But "slice" rolls off the tongue nicely, and I am tempted to use it for all random-access ranges. In that case, IRange(T) would be renamed ISlice(T). Opinions, anyone?

In any case, I am in the process of adding range support to my collections library, Loyc.Collections. Note that the collection interfaces will go in Loyc.Essentials, a library of small methods, simple data structures and core interfaces, while most of the actual collection implementations will go in Loyc.Collections. I will also have to review the existing interfaces to understand how they relate to ranges and whether one or more of them should be eliminated or combined with ranges somehow. Perhaps I'll post again when it's done.

Loyc.Collections contains some data structures I've published articles about already, such as the AList, VList and CPTrie, but also some interesting unpublished data structures such as the BList, BMultiMap, Set/MSet, Map/MMap, InvertibleSet, and DList. Let me know if you would like me to prioritize writing articles about these things (I also have a parser generator and a new programming language in the works, you know, plus I have to find the time to set up a web site for all of this).

P.S. An ominous footnote:

The mutable ranges do not include a Clone method due to a limitation of C#. C# does not support covariance, which means that every time a derived interface supports cloning, the implementing class is required to write a separate clone method. Read-only ranges already have to implement up to three clone methods: ICloneable(IFRange(T)), ICloneable(IBRange(T)), ICloneable(IRange(T)), and that's in addition to the Clone method for the concrete type! If mutable ranges also supported cloning, they would add up to three more clone methods, which is really getting out of hand.

Even without mutable clone methods, implementing all the ICloneables can be a chore. But you can just copy and paste these, inserting an appropriate constructor call for your range type:
  IFRange<T>  ICloneable<IFRange<T>>.Clone() { return Clone(); }
  IBRange<T>  ICloneable<IBRange<T>>.Clone() { return Clone(); }
  IRange<T>   ICloneable<IRange<T>> .Clone() { return Clone(); }
  public MyType Clone() { return new MyType(...); }
When complete, EC# will, of course, take care of this crap for you.

Update: it was pointed out to me that although algorithms usually need ranges, there is other code that has some need for ordinary iterators. In .NET we typically use an index when we need an iterator, but this only works for random-access data structures, and it is not as type-safe as an iterator (although it has the virtue of storage efficiency.) For non-random-access data structures, the solutions are either ad-hoc (e.g. LinkedListNode inside a LinkedList) or non-existant (SortedDictionary has neither an iterator nor a range type; an iterator would have been one solution for this question I had). Comments are welcome about how to deal with the need for iterators in .NET.

I suppose the problem with C++ iterators is that they are useless without external context: you can't increment or decrement one without also comparing it to begin() or end() in its container, which implies that the caller must manually keep track of which container it came from. Thus, an iterator is hardly an improvement over a plain-old integer index, the only advantages being
  1. You can dereference it without reference to its container (but this is nothing special; you can dereference an object reference too)
  2. Unlike an index, it's compatible with non-random-access data structures (and huge data structures, for which a 32-bit integer index does not suffice).
Perhaps the iterator concept could be improved by being made self-aware: if the iterator "knew" and could tell you when it was at the beginning or end of its container. This would increase the storage requirement in some circumstances but not others. For example, an iterator to a doubly-linked list node can tell when it is at the beginning or end, but an iterator to a singly-linked list node can only tell when it is at the end (or rather, when there are no more elements after the current one. If the iterator follows the 'null' pointer at the end of a linked list, it becomes null itself, and can no longer increment or decrement).

A pointer inside an array may or may not be able to tell if it is at the end depending on how arrays work. Perhaps the way D heap arrays work would allow an array iterator, implemented as a simple pointer, to still know when it is safe to increment and decrement. The simplest possible .NET array iterator is an array reference plus an index, and again the iterator can know when it is at the beginning or end of the array--except that if the iterator came from inside a range, it would be unaware of the range's boundaries, which may be smaller than the array's boundaries.

The question is, what does an appropriate iterator type look like? Could it follow an interface that would make it more useful than a C++ iterator, but still more efficient than a range?

< Published on >
< Comments on Reddit > because Redditors don't like to comment on the actual blog.

6 comments:

Anonymous said...

I am just curious, have you seen this: http://msdn.microsoft.com/en-us/library/hh881542.aspx ?

Qwertie said...

Not before I wrote the article, no.

Harry McIntyre said...

Small, specific interfaces. I read something once that suggested that it would have been a Good Thing if interfaces could only have one member (but could still inherit).

As for your adapter types/coercion methods, do you have something like

var countable = someCollection.As(), which one could use instead of the standard as/is operators?

These would be handy if I was wanting to write some code that was optimised for types with adapters.

Qwertie said...

Yes, I am a believer in interfaces that only define one or two methods or add one or two methods on top of an existing interface. But I'd rather have go-style interfaces.

I am not sure what you mean about As(), but there are extension methods such as AsListSource() and AsList() to convert between old and new interfaces. And EC# defines a new form of the 'as' and cast operator '->' which is more fluent, as well as a third cast operator.

Foo(as IBar).Bar()
objectList[i](->string).Length

And I think I'll rename IListSource to IReadList in light of the new IReadOnlyList in .NET 4.5. IReadList has two methods that IReadOnlyList does not have, and the name 'IReadOnlyList' isn't fully appropriate because although the interface provides only read-only access, that doesn't mean that the list is actually read-only. It could still be writable--just not by you.

Unknown said...

someCollection.OfType().ToList() in linq, but not with his interfaces...
impl:
foreach(var item in source)
if(item is T)
yield return (T)item; //cannot use as with generics

navya said...
This comment has been removed by a blog administrator.