Loyc, etc.: Using Loyc.Essentials: collection interfaces

Loyc.Essentials is a library of "stuff that should be in the .NET BCL, but isn't." Today I will present one of the most important parts of Loyc.Essentials: its collection interfaces.

The .NET 4 version of Loyc.Essentials defines the .NET 4.5 interfaces IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T> (handy if you are still targeting Windows XP, which does not support .NET 4.5), and on top of that defines numerous additional interfaces for queues, stacks, slices, bidirectional enumerators, ranges, sinks, and more. (The new .NET 3.5 version relies on Theraot.Core instead to provide these interfaces and other features of .NET 4/4.5.) Several of these interfaces are widely used throughout Loyc.Essentials itself and other projects such as Loyc.Collections and LLLPG.

Programmers with less experience, or that don't work in areas where performance is important, may underestimate the importance of data structures and collection interfaces in software engineering. I have come to believe that collection interfaces are very important, and that the built-in collection interfaces of the .NET framework are somewhat impoverished and inflexible. I like to work with a variety of data structures and I like to have a variety of collection interfaces available to reflect that variety. There's so much more out there than just List<T>: linked lists, B+ trees, VLists, ALists, Hash Array Mapped Tries, Bloom filters, other search trees and tries... the possibilities are endless.

First, let me talk a little about the limitations that the current BCL interfaces suffer from.

The .NET framework does not support much diversity. Traditionally it defined just four collection interfaces: IEnumerable<T>, ICollection<T>, IList<T>, and IDictionary<T>, and all but one were mutable.

Flaw #1: IList<T>/ICollection<T> contain some methods that are almost never used, like CopyTo(), but lack methods that are often desired, like AddRange() and RemoveRange(), and they lack methods that would be often used if they existed, like Slice().

Flaw #2: Lack of support for read-only interfaces. In .NET 4.5, they finally introduced IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T>, but they didn't derive IList<T> from IReadOnlyList<T> and they didn't declare that the core collection classes actually implement these interfaces. Plus, "IReadOnlyCollection" is an terribly long name for simply adding a Count property to IEnumerable (Loyc.Essentials used to define exactly the same interface, but it was called IEnumerableCount<T>).

Ideologically, I'm opposed to the idea that a mutable interface such as IList<T> should not implement IReadOnlyList<T>. If a method takes an argument of type IReadOnlyList<T>, it should mean that the method will not modify the list; it should not mean that the caller has to create a special read-only wrapper object on the heap, and it certainly should not mean that the caller has to guarantee that the list will never change. IReadOnlyList<T> should not be a burden or difficulty that the caller must overcome, rather it should simply be a promise that the called method will not modify the list. That leads me to flaw #3...

Flaw #3: because IReadOnlyList<T> and IList<T> are officially unrelated, and IReadOnlyList<T> is not supported very well, if a data structure implements both interfaces, C# "ambiguity" errors are inevitable. Here's how it typically happens: you want to declare a method that takes a list but doesn't modify it:

  void Foo(IReadOnlyList<T> list) {...}

But you notice that most collection classes don't implement IReadOnlyList<T>, so you add an overload that automatically converts IList<T> to IReadOnlyList<T> with an extension method (e.g. AsListSource() in the Loyc.Collections namespace):

  // Alias for Foo(IReadOnlyList<T>)
  void Foo(IList<T> list) { Foo(list.AsListSource()); }

So far, so good. But as new data structures are implemented, they will naturally implement IReadOnlyList<T>. If they are mutable, they will want to implement IReadOnlyList<T> so that it will be possible to pass them to methods that accept IReadOnlyList; meanwhile, immutable data structures might choose to explicitly implement IList<T> in order to stay compatible with the large amount of older code based on IList<T> for read-only access. So what's the problem? Suppose I have a data structure MyList<T> that implements both interfaces. Look what happens when I call Foo:

Foo(new MyList<T>()); // ERROR: The call is ambiguous 
  // between the following methods or properties: 
  // '...Foo(System.Collections.Generic.IList<T>)' and 
  // '...Foo(System.Collections.Generic.IReadOnlyList<T>)'

Flaw #4: A lack of variety and flexibility. If you actually survey the various data structures that exist, you'll find a wide variety of capabilities that they may or may not offer.

A data structure might support an indexing operator but not a Count property (consider the Sieve of Eratosthenes of infinite size). Other data structures may support a Count property in theory, but calling it should be avoided because it may be expensive (in Loyc.Essentials, BufferedSequence<T> has this property).
A data structures may be queue- or stack-like, offering access to the beginning/end of a data structure but not the middle.
A data structure might not be zero-based, e.g. the range of valid indexes could be 1..100 or -100..100 or 35..39.
A data structure may be sparse, in which case scanning it for items with a for-loop from 0..Count is inefficient.
A data structure may be a sink, allowing insertion of data but not reading (write-only files and pipes may be viewed this way).
Most list data structures could allow enumeration to start at any point in the middle, but .NET ignores this possibility; in addition, indexing is less efficient than enumeration for some lists (e.g. AList), so this should be supported.
A data structures may not support access by integer index, but could still allow bi-directional enumeration moving both forward and backward), e.g. a doubly-linked list, with insertion locations denoted by enumerator positions.
Many data structures can easily tell you if they are empty, but require O(N) time to report their Count. .NET almost forces you to use Count == 0 to detect emptiness; you can use LINQ's Any() function instead, but this requires a heap allocation and two interface calls.

This list is certainly incomplete; the possible combinations of operations that data structures may or may not support is just about endless.

Other collection libraries outside .NET have recognized this diversity, and provide more concepts or interfaces to address it, such as the C++ STL's various iterator and container concepts, or D's ranges.

Suppose I write a method that reads a list starting at the outside and moving inward (e.g. first element, last element, second element, second-last element, etc.). A linked list can certainly support this order of traversal; unfortunately, in .NET, the only standard collection interface that can be used in this situation is IList<T>, which a linked list data structure cannot support efficiently. Thanks to .NET's simplistic interfaces, my method can either support standard linked lists, or normal lists, but not both at once.

This is the central problem that the .NET interfaces fail to solve. When you write a method or a class that operates on a collection, you cannot declare precisely what kind of collection you accept, so most of the time you have to mandate functionality you don't need. Some collection types will not be able to provide that functionality that you mandated but didn't need, so they won't implement the interface and will be incompatible, unnecessarily. Meanwhile, other collection types might implement the interface even though they are not compatible, and then throw an exception when you try to use unsupported functionality.

How Loyc.Essentials improves the situation

Loyc.Essentials improves the situation mainly by providing a much wider variety of interfaces. The new interfaces mostly address flaw #4, but can only partly alleviate the pain of flaws 1, 2 and 3.

Interfaces are cheap to define: they are only a few lines of code plus some documentation, so it's well worth the effort of defining a variety of them, even if no implementations are provided (too bad Microsoft didn't see it that way). Before I get to the actual interfaces, here's a list of the key concepts that are supported:

Sources: objects that provide data. Source interfaces are read-only.
Sinks: objects that accept data. Sink interfaces are write-only.
Slices: subsections of an indexed list, e.g. elements 5..10 of a larger list. The broad class of Divide-and-conquer algorithms benefit from these.
Ranges: Similar to slices, ranges represent a subsection of a collection, but they may only allow access to the first, or first and last, elements of that subsection. For performance reasons, ranges are mutable in the sense that you can shrink the range one element at a time. This does not modify the underlying collection, it just causes the range to point to a smaller area in the same collection. Ranges were popularized by D, and while they don't work as well in .NET, I thought that the interfaces were still worth providing. Here is a long-ish article about my design of Ranges in .NET.
Fancy enumerators: after designing the range interfaces, I realized that often all you need is an enumerator that can go backward. Binumerators can travel both backward and forward through a collection, and mutable variants allow the current item to be changed or deleted.
Specific categories of data structures: queues, arrays, and sparse lists.
"Neg" non-zero-based lists: indexed lists for which the minimum index is not necessarily zero.

The Loyc.Essentials collection interfaces

The full documentation of these interfaces can be seen in the source code and will be provided automatically by Visual Studio Intellisense, as long as you have the Loyc.Essentials.xml file alongside your copy of Loyc.Essentials.dll.

First, an interface you won't use: ICount:

  public interface ICount : IIsEmpty
  {
    int Count { get; }
  }

I thought the concept of "having a count" should be separated because some collections may have a count but do not allow enumeration or vice versa. For example, I defined IStack<T> as a collection that had a Count but did not allow enumeration. However, Microsoft made ICount unusable in .NET 4.5 by defining IReadOnlyCollection:

  public interface IReadOnlyCollection : IEnumerable
  {
    int Count { get; }
  }

How does this new interface make it impossible to use ICount? Well, if I define any interface that inherits from both IReadOnlyCollection and ICount, it becomes impossible (well, quite difficult) to call Count on that interface because the C# compiler says that the reference to Count is "ambiguous". The end result is that people won't bother using ICount at all (admittedly, the number of cases where somebody needs the Count of a collection, and nothing else, is very small). For completeness, I recently added IIsEmpty, because as I mentioned, IsEmpty can run much faster than Count in some data structures:

  public interface IIsEmpty
  {
    bool IsEmpty { get; }
  }

However, I decided not to mandate this property as part of the common Loyc.Essentials interface IListSource<T>, although it is a part of IFRange<T> (listed near the bottom of this article).

My single most favorite interface in Loyc.Essentials is IListSource<T>. As its name implies, it is a kind of source; it's basically IReadOnlyCollection<T> with extra functionality:

  public interface IListSource<out T> : IReadOnlyList<T>
  {
    // If index is invalid, sets fail=true and returns default(T)
    T TryGet(int index, out bool fail); 
    IRange<T> Slice(int start, int count = int.MaxValue);
  }

Both of these new functions are favorites of mine, although it's usually easier to call one of the standard TryGet() extension methods (especially the first one):

  public static partial class LCInterfaces
  {
    public static T TryGet<T>(this IListSource<T> list,
                              int index, T defaultValue);
    public static bool TryGet<T>(this IListSource<T> list, 
                                 int index, ref T value);
  }

Plus, a couple of standard methods of IList<T> are added as extension methods:

  public static partial class LCInterfaces
  {
    public static int IndexOf<(this IReadOnlyList<T> list, T item)
    public static void CopyTo<T>(this IReadOnlyList<T> c, T[] array, int arrayIndex)
  }

And there are other extension methods that will be covered in a future article.

I don't know about you, but in my code I often have to check whether the index is in range before I call the indexer:

  if (index < list.Count && list[index] >= 0) {...}

With IListSource this is easier:

  if (list.TryGet(index, -1) >= 0) {...}

In theory, this version should be faster too, because it needs only one interface dispatch, not two. That's not the only reason for TryGet() to exist, though; there are a few collection types for which it is expensive to call Count.

The second extension method uses ref T value to allow the caller to set a default T value before calling the method, since you will not always want to use default(T) as the default. If the index is invalid, value is left unchanged.

Meanwhile, the Slice() method returns a sub-section of the list without allocating a new list. For example, list.Slice(10, 10) gets the collection of 10 items starting at index 10. IRange<T> is a superset of IListSource<T>; see definition below. Don't worry, if you're creating a collection class, you never actually have to write a class that implements IRange<T>. Just use this implementation:

  public IRange<T> Slice(int start, int count) {
    return new Slice_<T>(this, start, count);
  }

or the potentially more efficient version,

  IRange<T> IListSource<T>.Slice(int start, int count)
    { return Slice(start, count); }
  public Slice_<T> Slice(int start, int count)
    { return new Slice_<T>(this, start, count); }

Slice_ is a struct in Loyc.Essentials that provides a "view" on part of a read-only list. Why the underscore? I wanted to simply call it "Slice", but that is illegal in C# because Slice_ itself implements IListSource<T>, so it contains a method named Slice, and a method is not allowed to have the same name as its containing class. By the way, there's also a ListSlice<T> class and Slice() extension method for slicing IList<T>.

Next up, here are the "neg lists". These are list interfaces that do not (necessarily) use zero as the minimum index. So far, I haven't used these interfaces in practice.

  public interface INegListSource<T> : IReadOnlyCollection<T>
  {
    int Min { get; }
    int Max { get; }
    T this[int index] { get; }
    T TryGet(int index, out bool fail);
    IRange<T> Slice(int start, int count = int.MaxValue);
  }
  public interface INegArray : INegListSource
  {
    new T this[int index] { set; get; }
    bool TrySet(int index, T value);
  }
  public interface INegAutoSizeArray<T> : INegArray<T>
  {
    void Optimize();
  }
  public interface INegDeque<T> : INegArray<T>, IDeque<T> 
  {
  }

Here's documentation, if you're interested. INegAutoSizeArray is interesting: it automatically enlarges itself when you write to an index below Min or above Max.

Next there's a higher-performance variation on INotifyCollectionChanged, which can only be implemented by collections that implement IListSource<T>:

  public interface INotifyListChanging<T>
  {
    event ListChangingHandler<T> ListChanging;
  }
  public delegate void ListChangingHandler<T>(
    IListSource<T> sender, ListChangeInfo<T> args);
  [Serializable]
  public class ListChangeInfo<T> : EventArgs
  {
    public ListChangeInfo(NotifyCollectionChangedAction action,
      int index, int sizeChange, IListSource<T> newItems);

    public readonly NotifyCollectionChangedAction Action;
    public readonly int Index;
    public readonly int SizeChange;
    public readonly IListSource<T> NewItems;
  }

Why is this faster? Because unlike INotifyCollectionChanged, this interface requires as little as one memory allocation each time the list changes (for the ListChangeInfo object). INotifyCollectionChanged requires the collection to allocate a list of OldItems and NewItems; but because ListChanging fires before the list changes rather than afterward, there is no need to allocate a list of old items--the event handler can simply look at the current state of the list. Meanwhile, it is possible to optimize NewItems not to require an allocation.

Admittedly, I have found that sometimes an event handler would really prefer to see the list after it changes rather than before. But if performance matters, this interface may be better. You be the judge.

Next I present this peculiar bunch of interfaces, which reflects the difficulty I had designing an interface for immutable sets:

  public interface ISetTests<SetT>
  {
    bool IsProperSubsetOf(SetT other);
    bool IsProperSupersetOf(SetT other);
    bool IsSubsetOf(SetT other);
    bool IsSupersetOf(SetT other);
    bool Overlaps(SetT other);
    bool SetEquals(SetT other);
  }
  public interface ISetOperations<in T, SetT> 
    : ISetOperations<T, SetT, SetT> { }
  public interface ISetOperations<in T, in InSetT, out OutSetT>
  {
    OutSetT With(T item);
    OutSetT Without(T item);
    OutSetT Union(InSetT other);
    OutSetT Intersect(InSetT other);
    OutSetT Except(InSetT other);
    OutSetT Xor(InSetT other);
  }
  public interface ISetImm<T, SetT> : ISetOperations<T, SetT>, 
    ISetTests<SetT>, IReadOnlyCollection<T> //, ICount
  {
    bool IsInverted { get; }
  }
  public interface ISetImm<T> : ISetOperations<T, IEnumerable<T>>, 
    ISetTests<IEnumerable<T>>, IReadOnlyCollection<T>
  {
  }

The most important interface is at the end: ISetImm<T>. If you work through it logically, you can see that ISetImm<T> contains the following members:

public interface ISetImm<T> : ...
{
  bool IsProperSubsetOf(IEnumerable<T> other);
  bool IsProperSupersetOf(IEnumerable<T> other);
  bool IsSubsetOf(IEnumerable<T> other);
  bool IsSupersetOf(IEnumerable<T> other);
  bool Overlaps(IEnumerable<T> other);
  bool SetEquals(IEnumerable<T> other);
  ISetImm<T> With(T item);
  ISetImm<T> Without(T item);
  ISetImm<T> Union(IEnumerable<T> other);
  ISetImm<T> Intersect(IEnumerable<T> other);
  ISetImm<T> Except(IEnumerable<T> other);
  ISetImm<T> Xor(IEnumerable<T> other);
  bool IsInverted { get; }
}

.NET already defines a set interface for mutable sets, ISet<T>, but immutable sets are more convenient to work with so I added this interface. In Loyc.Collections.dll, I implemented a very nice immutable set class called Set<T>, which is a kind of Hash Tree. Originally my Set<T> structure implemented only ISetImm<T, Set<T>>, because it was easier, but eventually I also implemented the more generic ISetImm<T> interface. I also made a mutable variant of Set<T>, the MSet<T> class, which implements both ISet<T> and ISetImm<T>.

Finally, I created an InvertibleSet<T> class, which can represent everything that is not in some other set; to account for this possibility, I added the IsInverted property to ISetImm.

Sparse lists are indexed lists in which regions of indexes may be unused. Here are the sparse list interfaces I defined:

  public interface ISparseListSource<T> : IListSource<T>
  {
    int? NextHigher(int index);
    int? NextLower(int index);
    bool IsSet(int index);
  }
  public partial class LCInterfaces
  {
    public static IEnumerable<KeyValuePair<int, T>> Items<T>(
      this ISparseListSource<T> list);
  }
  public interface ISparseList<T>
    : ISparseListSource<T>, IListAndListSource<T>
  {
    void Clear(int index, int count = 1);
    void InsertSpace(int index, int count = 1);
  }
  public interface ISparseListEx<T> 
    : ISparseList<T>, IListEx<T>, IAutoSizeArray<T>
  {
    void InsertRange(ISparseListSource<T> list);
  }

ISparseListSource is a read-only sparse list, while ISparseList is mutable and ISparseListEx additionally allows you to insert the contents of one entire sparse list into another. According to the definitions shown here, an ISparseList<T> is fundamentally different from an IDictionary<int, T> because when you insert or remove an item from a sparse list at index i, the index associated with each and every item above index i also changes. A Dictionary can't do that, while a good sparse list implementation can do it efficiently.

The motivating example that led me to create these interfaces is the concept of anchor points in a text editor, such as the locations of error messages, breakpoints, and syntax highlighting markers. Clearly, this data is sparse: the majority of characters do not have an associated error message or color change, so ideally we should store this information in some sort of "compressed" data structure. As the user types new text in the middle of the file, all of these associated locations that appear in the second half should be shifted; that's what a sparse list does. Loyc.Collections offers one implementation if ISparseList<T>, called SparseAList<T>. It is not always a fast data structure, but it scales up effectively to very large files.

Next up, here are the IQueue, IStack and IDeque interfaces, with associated extension methods for added convenience, which are mostly self-explanatory if you know what queues and stacks are:

  public interface IPush<in T>
  {
    void Push(T item);
  }
  
  public interface IPop<out T>
  {
    T TryPop(out bool isEmpty);
    T TryPeek(out bool isEmpty);
    bool IsEmpty { get; }
  }
  public static partial class LCInterfaces
  {
    public static T Pop<T>(this IPop<T> c);
    public static T Peek<T>(this IPop<T> c);
    public static bool TryPop<T>(this IPop<T> c, out T value);
    public static bool TryPeek<T>(this IPop<T> c, out T value);
    public static T TryPop<T>(this IPop<T> c);
    public static T TryPop<T>(this IPop<T> c, T defaultValue);
    public static T TryPeek<T>(this IPop<T> c);
    public static T TryPeek<T>(this IPop<T> c, T defaultValue);
  }
  
  public interface IQueue<T> : IPush<T>, IPop<T>, ICount
  {
  }
  public interface IStack<T> : IPush<T>, IPop<T>, ICount
  {
  }
  
  public interface IDeque<T>: ICount, IIsEmpty
  {
    void PushFirst(T item);
    void PushLast(T item);
    T TryPopFirst(out bool isEmpty);
    T TryPeekFirst(out bool isEmpty);
    T TryPopLast(out bool isEmpty);
    T TryPeekLast(out bool isEmpty);

    T First { get; set; }
    T Last { get; set; }
    bool IsEmpty { get; }
  }

  public static partial class LCInterfaces
  {
    public static T PopFirst<T>(this IDeque<T> c)
    public static T PopLast<T>(this IDeque<T> c)
    public static T PeekFirst<T>(this IDeque<T> c)
    public static T PeekLast<T>(this IDeque<T> c)
    public static bool TryPopFirst<T>(this IDeque<T> c, out T value)
    public static bool TryPopLast<T>(this IDeque<T> c, out T value)
    public static bool TryPeekFirst<T>(this IDeque<T> c, out T value)
    public static bool TryPeekLast<T>(this IDeque<T> c, out T value)
    public static T TryPopFirst<T>(this IDeque<T> c)
    public static T TryPopFirst<T>(this IDeque<T> c, T defaultValue)
    public static T TryPopLast<T>(this IDeque<T> c)
    public static T TryPopLast<T>(this IDeque<T> c, T defaultValue)
    public static T TryPeekFirst<T>(this IDeque<T> c)
    public static T TryPeekFirst<T>(this IDeque<T> c, T defaultValue)
    public static T TryPeekLast<T>(this IDeque<T> c)
    public static T TryPeekLast<T>(this IDeque<T> c, T defaultValue)
  }

Currently IQueue and IStack are not being used, but IDeque<T> is implemented by DList<T>. Next up, here are the sink interfaces, which let you put data in but not take it out.

  /// <summary>An interface for depositing items. Includes only an Add(T) method.</summary>
  public interface IAdd<in T>
  {
    void Add(T item);
  }
  /// <summary>Represents a write-only collection: you can modify it, but you
  /// cannot learn what it contains.</summary>
  public interface ISinkCollection<in T> : IAdd<T>
  {
    void Clear();
    bool Remove(T item);
  }
  /// <summary>Represents a write-only array.</summary>
  public interface ISinkArray<in T>
  {
    T this[int index] { set; }
  }
  /// <summary>Represents a write-only indexable list class.</summary>
  public interface ISinkList<in T> : ISinkCollection<T>, ISinkArray<T>
  {
  }

These interfaces are not often useful, but sometimes people actually do write code that only modifies a collection, without reading it. One useful property of these interfaces is contravariance: if a method needs an ISinkCollection<Foo> for example, you can pass it a ISinkCollection<object> instead.

Here's an interface that models built-in arrays:

  public interface IArray<T> : IListSource<T>, ISinkArray<T>
  {
    new T this[int index] { get; set; }
    bool TrySet(int index, T value);
  }
  public interface IAutoSizeArray<T> : IArray<T>
  {
    void Optimize();
  }

But of course, since T[] does not implement IArray, it's not very useful and is only included for completeness. The derived interface is a little more interesting: it represents a list whose size increases automatically when you write to an index beyond the end of the array. If supported, the Optimize() method optimizes the data structure to consume less storage space (if not supported, it has no effect.)

Next, here are some interfaces that add or remove items "in bulk":

  public interface IAddRange<T> : ICount
  {
    void AddRange(IEnumerable<T> e);
    void AddRange(IReadOnlyCollection<T> s);
  }
  public interface IListRangeMethods<T> : IAddRange<T>
  {
    void InsertRange(int index, IEnumerable<T> e);
    void InsertRange(int index, IReadOnlyCollection<T> s);
    void RemoveRange(int index, int amount);
  }
  public static partial class LCInterfaces
  {
    public static void Resize<T>(
      this IListRangeMethods<T> list, int newSize);
  }

It's often more efficient to use a bulk insertion or deletion method rather than calling Insert or RemoveAt in a loop.

Next up, here are some interfaces that are used to work around the dreaded ambiguity errors that are caused by the fact that ICollection<T> is officially unrelated to IReadOnlyCollection<T>, and the fact that IList<T> is officially unrelated to IReadOnlyList<T>:

  public interface ICollectionAndReadOnly<T> 
    : ICollection<T>, IReadOnlyCollection<T> { }
  public interface IListAndListSource<T> : IList<T>, 
    IListSource<T>, ICollectionAndReadOnly<T>  { }

Any collection that implements both ICollection<T> and IReadOnlyCollection<T> should implement ICollectionAndReadOnly<T>, and any collection that implements both IList<T> and IReadOnlyList<T> should go the extra mile and implement IListAndListSource<T>.

By itself, these interfaces don't solve any problems. To recap, if you write these two methods:

  void Foo(IReadOnlyList<T> list) {...}
  void Foo(IList<T> list) { Foo(list.AsListSource()); }

The C# compiler will give an "ambiguity" error when you try to call Foo(list), if the list implements both IList<T> and IReadOnlyList<T>. To work around this problem, it is necessary to define a third method that takes IListAndListSource<T>:

  void Foo(IReadOnlyList<T> list) {...}
  void Foo(IList<T> list) { Foo(list.AsListSource()); }
  void Foo(IListAndListSource<T> list) { Foo((IReadOnlyList<T>) list); }

This workaround only works if the list class implements IListAndListSource<T>, so all mutable collections in Loyc.Essentials and Loyc.Collections do so.

Next, here are some "enhanced" collection interfaces, which just add extra functionality to the standard interfaces. ICollectionEx combines ICollection with AddRange(list) and RemoveAll(lambda), while IListEx combines IList with ICollectionEx, InsertRange(index, list) and RemoveRange(index, count):

  public interface ICollectionEx<T> : ICollectionAndReadOnly<T>, 
      ISinkCollection<T>, IAddRange<T>, IIsEmpty { }
  {
    int RemoveAll(Predicate<T> match);
  }
  public interface IListEx<T> : IListAndListSource<T>, 
    ICollectionEx<T>, IArray<T>, IListRangeMethods<T> { }

Binumerators are enumerators that can move backward through a collection, not just forward. The IBinumerable interface can be implemented by a collection that supports binumerators. Mutable enumerators additionally allow elements of the collection to be modified during enumeration. Mutable enumerators optionally support removing the current element.

public interface IBinumerator<T> : IEnumerator<T>
{
  bool MovePrev();
}

interface IMEnumerator<T> : IEnumerator<T>
{
  /// <summary>Gets or sets the value of the current item.</summary>
  new T Current { get; set; }
  /// <summary>Removes the current item and moves to the next one. Remember
  /// NOT to call MoveNext() immediately after Remove().</summary>
  /// <returns>True if there is a next item after this one, 
  /// false if the removed item was the last one.</returns>
  /// <exception cref="NotSupportedException">The collection does not permit
  /// this operation.</exception>
  bool Remove();
}

interface IMBinumerator<T> : IBinumerator<T>, IMEnumerator<T>
{
}

interface IBinumerable<T>
{
  /// <summary>Returns a binumerator that points before the beginning of 
  /// the current collection.</summary>
  IBinumerator<T> Begin();
  /// <summary>Returns a binumerator that points after the end of the 
  /// current collection.</summary>
  IBinumerator<T> End();
}

Finally, here are the range interfaces, which were inspired by the D programming language. These were described in a much earlier article but I tweaked the interfaces after that article was written. You can see the current documentation in the source code; I won't say anything more about these interfaces here, since this article is getting pretty long already.

  public interface IFRange<out T> : IEnumerable<T>,
    ICloneable<IFRange<T>>, IIsEmpty
  {
    T Front { get; }
    T PopFront(out bool fail);
  }
  public interface IMFRange<T> : IFRange<T>
  {
    new T Front { get; set; }
  }
  public interface IBRange<out T> : IFRange<T>,
    ICloneable<IBRange<T>>
  {
    T Back { get; }
    T PopBack(out bool fail);
  }
  public interface IMBRange<T> : IBRange<T>, IMFRange<T>
  {
    new T Back { get; set; }
  }
  public interface IRange<out T> : IBRange<T>,
    IListSource<T>, ICloneable<IRange<T>>
  {
  }
  public interface IBRangeEx<R, T> : IBRange<T> 
    where R : IBRangeEx<R, T>, ICloneable<R>
  {
    IEnumerable<T> InnerList { get; }
    int SliceStart { get; }
    R Intersect(R other);
    R Union(R other);
  }
  public interface IBRangeEx<T> : IBRangeEx<IBRangeEx<T>, T>, 
    ICloneable<IBRangeEx<T>>
  {
  }
  public interface IRangeEx<R, T> : IRange<T>, 
    IBRangeEx<R, T> where R : IRangeEx<R, T>, ICloneable<R>
  {
  }
  public interface IRangeEx<T> : IRangeEx<IRangeEx<T>, T>, 
    ICloneable<IRangeEx<T>>
  {
  }
  public interface IMRange<T> : IMBRange<T>, IRange<T>
  {
    new T this[int index] { get; set; }
  }

That's it! That's all the general-purpose collection interfaces.

Performance: one flaw I didn't fix

It bothers me that the IEnumerator<T> interface requires two interface calls to retrieve each element of a collection: MoveNext() and Current. That's just a pointless waste of clock cycles. In the past, I defined an interface that fixes that problem by moving to the next item and returning it with a single method call. It was called Iterator<T> and it was a delegate (alternate possible design: a derived interface of IEnumerator<T>):

  public delegate T Iterator<out T>(ref bool fail);

Note that I really would have preferred to use the signature bool Iterator<T>(out T value) instead, but .NET this second version of the method is illegal. Why? Unfortunately, the CLR actually doesn't have the concept of an "out" parameter; to the CLR, "out" and "ref" are the same thing. Therefore, type T would have to be invariant, <out T> would not be allowed. The latter signature can still be provided as an extension method, at the cost of some performance.

I actually wrote a whole implementation of LINQ for Iterator and IIterable<T> (the counterpart to IEnumerable<T>), but then I decided to abandon the concept before I got around to benchmarking the LINQ implementation (early microbenchmarks showed that IIterable was modestly faster when called directly, but not when using the extension method).

Ultimately I decided that I didn't want to burden potential implementors of collection classes by including IIterable<T> in the interface hierarchy. that is, it seemed natural to derive other interfaces such as IListSource<T> from IIterable<, but this placed a substantial burden on my collection class implementations, and might have confused end-users. Obviously, if you're writing a collection class, you want to support things like foreach loops and traditional LINQ, so you will definitely implement IEnumerable<T>; IIterable<T> can only be a bonus, not the central mechanism of enumeration. Then you have the dillema of whether to optimize your code for IEnumerable<T> or IIterable<T>, plus there's the fact that C#'s iterator feature supports only IEnumerator, not Iterator. So I found that the implementation costs for Iterator tended to be high, while few people would appreciate or take advantage of the performance enhancement.

For certain applications that need both high performance and flexibility, another technique for avoiding interface calls makes sense: returning data in groups, such as arrays. Consider this method of ICollection<T> that is almost never used:

  void CopyTo(T[] array, int arrayIndex);

Imagine that method was gone and replaced with this method in IList:

  // copy this[startIndex..startIndex+count] to array[arrayIndex..arrayIndex+count]
  void CopySlice(T[] array, int arrayIndex, int startIndex, int count);

Now you would have a technique for optimizing access when you need it. Rather than requesting the entire list as an array, which requires unbounded memory, you request a section. How can this be used to optimize a program? Well, you can read the list in small blocks, e.g., 50 elements at a time. Then you can use a for-loop like this one:

  for (int i = 0; i < array.Length; i++) {
    /* do something with array[i] */
  }

which avoids the cost of (1) interface calls and (2) bounds checking (the JIT eliminates the bounds check in a loop that is structured this way), in exchange for the need to allocate a small temporary array (which is usually cheap). Obviously, this is not a convenient way to access a list; again, this would only be for the (rare?) situation that calls for high performance, but still requires the flexibility of interfaces (so you can swap in multiple implementations).

For Loyc, I decided to define a specialized interface for lexers:

public interface ICharSource : IListSource<char>
{
  /// <summary>
  /// Returns a substring from the character source. If some of the
  /// requested characters are past the end of the stream, the string
  /// is truncated to the available number of characters.
  /// </summary>
  /// <param name="startIndex">Index of first character to return.
  /// If startIndex >= Count, an empty string is returned.</param>
  /// <param name="length">Number of characters desired.</param>
  new StringSlice Slice(int startIndex, int length);
}

This interface was created to read characters efficiently. Although a lexer could read characters one-at-a-time from IReadOnlyList<char> or IListSource<char>, it requires at least one dynamic interface dispatch for every character (multiple, if the lexer reads the same character more than once). On the other hand, if lexers avoid this overhead by requiring the entire file in the form of a string, it becomes necessary to hold the entire file in memory at once, in a very specific format (one huge string), which may not be ideal.

StringSlice is a structure in Loyc.Essentials that represents a slice of a string (a tuple of (string str, int startIndex, int length)). Slice() is a compromise that allows the lexer to request small pieces of the file that it can read without dynamic dispatch. If the file happens to be stored as a string, Slice() can simply return a slice of that string (no memory allocation required), otherwise it will build a string and return that.

It's unfortunate that .NET treats strings as something completely different than arrays. In a perfect world (or in D), an interface like this one could support not just substrings, but read-only subarrays of any element type, which would be useful any time you want to optimize your code by reducing dynamic dispatch.

By the way: we're doing it wrong.

Defining zillions of interfaces isn't the best solution.

As I said, the possible combinations of operations that data structures may or may not support is just about endless. Some may support adding items but not removing; others may support removing but not adding; some lists may be read-only but have a specialized, fast IndexOf() method (recall that IListSource<T> does not include IndexOf()). It would be crazy to define a separate interface, and invent a separate name, for every possible combination of operations. Inevitably, there are some data structures for which Loyc.Essentials will offer no appropriate interface. But in .NET, that's inevitable.

My favorite thing about the Go language (which, admittedly, I haven't actually used) is the way they do interfaces. In Go, a data structure does not have to explicitly declare in advance every interface it satisfies. Instead, a data structure can satisfy any interface defined by a third party if it provides the necessary operations. If .NET worked like this, then it wouldn't matter that List<T> does not say it implements IReadOnlyList<T>; you would still be allowed to pass it to a method that takes IReadOnlyList<T>.

So, suppose I define an indexable data structure that supports add-at-the-end, but no other modifications. Then I could define an interface that exactly matches my data structure:

interface IAppendableList<T> : IReadOnlyList<T> { void Add(T item); }

class GuestBook : IAppendableList<GuestEntry> { ... }

Later, someone else who is using MyDataStructure<T> in their code could decide to declare a method that accepts IAppendableList<T>:

void Foo(IAppendableList<T> x) { ... }

We can tell that Foo() will probably add items to x, but won't remove items. If .NET supported Go's ability to adapt to new interfaces with almost no extra run-time cost, it would not only accept an argument of type GuestBook, but it would also accept any of the standard collection classes like List<T>. Nice. This feature could neatly fix flaws 2 and 3 above, even if Microsoft itself never defined any read-only or specialized interfaces.

IMO, this is how it should be, and I would point out that by not supporting this feature, it is not practical for .NET as a platform to support languages like Go that do support it (so much for being a "common language runtime"). Now, I assume we can't add this feature directly to C# because it would break existing code, but the feature could still be supported in .NET itself, if Microsoft had the willpower.

Some people argue against this feature, saying that an "interface" is more than just a set of methods but a contract for how those methods behave. And in Loyc.Essentials itself you'll see two interfaces that reflect that idea:

public interface IQueue<T> : IPush<T>, IPop<T>, ICount
{
}
public interface IStack<T> : IPush<T>, IPop<T>, ICount
{
}

The way I defined queues and stacks is that both of them have the same methods--pushing, popping, and Count--but because they behave quite differently, I defined two separate interfaces. So the argument against "Go interfaces" makes sense, but in my opinion their advantages clearly outweigh the disadvantages, and anyway the feature already exists in some languages, so if Microsoft truly wants a many-language platform, they must make features like this possible, and efficient.

Another argument in favor of Go interfaces is that they make it easier to satisfy dependencies. For example, I defined ICharSource inside Loyc.Essentials even though the interface was designed for lexers (which suggests Loyc.Syntax.dll would be a better place for it), because Loyc.Essentials contains StringSlice and I wanted StringSlice to implement ICharSource. Since Loyc.Syntax depends on Loyc.Essentials, the only way this can work is if ICharSource is part of Loyc.Essentials. Luckily I control both DLLs, otherwise StringSlice could not implement ICharSource.

In general, I find it to be a big nuisance that when two DLLs define identical interfaces, with the same name in the same namespace, the CLR considers them completely unrelated and incompatible, and there is no workaround. If two DLLs want to implement the same interface, they must always have a reference to a third DLL that contains the interface, and both DLLs must rely on the same version of that third DLL even if the interface in question is identical across different versions. The Go approach gets you out of versioning hell (at least it would in the .NET context. Whether Go itself has a similar issue, I have no idea). But I digress...

Conclusion.

Loyc.Essentials is a library of "stuff that should be in the BCL, but isn't." Although it was created for the Loyc project, it is meant as a general-purpose library for any developer's .NET toolbox. If there's a general-purpose interface that you think should be part of the BCL, write it in a comment and we'll discuss whether it should be added to Loyc.Essentials.dll.

Some of these interfaces are used heavily in Loyc, while others are not used at all. In future posts I'll introduce the other classes, structures, and handy methods of Loyc.Essentials.dll, and then I'll move on to Loyc.Collections.dll.

Edit: I missed an interesting interface: IEnumeratorFrame<Frame, T>, which goes with the NestedEnumerator<Frame, T> structure, and is used for enumerating tree-like data structures or manually-implemented coroutines. See the code if you're curious, although this interface would work best as part of a built-in compiler feature for stream flattening like Cω had.

Loyc, etc.

Blog Archive

About Me

My web sites

StatCounter

Monday, February 3, 2014

Using Loyc.Essentials: collection interfaces

How Loyc.Essentials improves the situation

The Loyc.Essentials collection interfaces

Performance: one flaw I didn't fix

By the way: we're doing it wrong.

Conclusion.

No comments: