I have moved my blog to GitHub!
The main reason for this is that blogspot has always had a lousy and buggy editor, which forces me to type all posts in HTML; on GitHub I can use markdown instead. Plus, on Blogspot I don't get syntax highlighting. Plus, on GitHub I can add pages easily, I am not limited to blog posts. Supporting comments on GitHub is a slight pain, but doable, and I'm glad to give my blog a fresh new look. Getting Jekyll to run in Windows is a huge (and recurring!) pain though, and its error messages are the worst.
I also wrote a post about how I set up the GitHub site and how I imported everything from blogspot.
Thursday, July 3, 2014
Wednesday, May 14, 2014
2D Convex hull in C#: 40 lines of code
Note: this blog has moved here.
If you want a convex hull and you want it now, you could go get a library like MIConvexHull. That library claims to be high-performance compared to a comparable C++ library, but that claim is implausible, especially for the 2D case, since the algorithm relies heavily on heap memory and dynamic dispatch, accessing all coordinates through an IVertex interface that returns coordinates as double[], and it uses LINQ rather liberally. (Update: Ouellet's Algorithm is a good choice.)
Loyc.Utilities has a much simpler convex hull algorithm in the PointMath class that you might find easier to adapt to your own codebase, and although I haven't benchmarked it, you can plainly see that scanning for the convex hull takes O(n) time and needs only simple math, so that the overall running time will be dominated by the initial sorting step (which takes O(n log n) time). Because of the simple math used in this algorithm, it performs well both on powerful desktop machines and (given integer or fixed-point workloads) on lower-power machines with no FPU.
Edit: There is now a LoycCore package in NuGet that includes Loyc.Essentials.dll which contains data types (e.g. DList) used by ComputeConvexHull(), as well as Loyc.Utilities.dll which contains ComputeConvexHull() itself.
Edit: original version was buggy.
Edit: I found that the algorithm took 11.3% less time for 10 million points if I changed the sorting step as follows:
Copyright: me (David Piepgrass). Do with the code as you wish (originally LGPL).
If you want a convex hull and you want it now, you could go get a library like MIConvexHull. That library claims to be high-performance compared to a comparable C++ library, but that claim is implausible, especially for the 2D case, since the algorithm relies heavily on heap memory and dynamic dispatch, accessing all coordinates through an IVertex interface that returns coordinates as double[], and it uses LINQ rather liberally. (Update: Ouellet's Algorithm is a good choice.)
Loyc.Utilities has a much simpler convex hull algorithm in the PointMath class that you might find easier to adapt to your own codebase, and although I haven't benchmarked it, you can plainly see that scanning for the convex hull takes O(n) time and needs only simple math, so that the overall running time will be dominated by the initial sorting step (which takes O(n log n) time). Because of the simple math used in this algorithm, it performs well both on powerful desktop machines and (given integer or fixed-point workloads) on lower-power machines with no FPU.
/// <summary>Computes the convex hull of a polygon, in clockwise order in a Y-up /// coordinate system (counterclockwise in a Y-down coordinate system).</summary> /// <remarks>Uses the Monotone Chain algorithm, a.k.a. Andrew's Algorithm.</remarks> public static IListSource<Point> ComputeConvexHull(IEnumerable<Point> points) { var list = new List<Point>(points); return ComputeConvexHull(list, true); } public static IListSource<Point> ComputeConvexHull(List<Point> points, bool sortInPlace = false) { if (!sortInPlace) points = new List<Point>(points); points.Sort((a, b) => a.X == b.X ? a.Y.CompareTo(b.Y) : (a.X > b.X ? 1 : -1)); // Importantly, DList provides O(1) insertion at beginning and end DList<Point> hull = new DList<Point>(); int L = 0, U = 0; // size of lower and upper hulls // Builds a hull such that the output polygon starts at the leftmost point. for (int i = points.Count - 1; i >= 0 ; i--) { Point p = points[i], p1; // build lower hull (at end of output list) while (L >= 2 && (p1 = hull.Last).Sub(hull[hull.Count-2]).Cross(p.Sub(p1)) >= 0) { hull.RemoveAt(hull.Count-1); L--; } hull.PushLast(p); L++; // build upper hull (at beginning of output list) while (U >= 2 && (p1 = hull.First).Sub(hull[1]).Cross(p.Sub(p1)) <= 0) { hull.RemoveAt(0); U--; } if (U != 0) // when U=0, share the point added above hull.PushFirst(p); U++; Debug.Assert(U + L == hull.Count + 1); } hull.RemoveAt(hull.Count - 1); return hull; }This code relies on Loyc's DList class which is like List<T> except that it can efficiently add and remove items from the beginning OR end of the list, and it has some extra stuff, such as First and Last which return the first and last item in the list, repectively. This code assumes the existence of a Point struct with any kind of coordinates (int, float, double, etc.) and methods such as Sub to subtract, Cross to compute the cross-product (a.X * b.Y - a.Y * b.X), etc. In Loyc.Utilities, the code is located within a T4 template and is actually replicated for int, float and double coordinates... although currently the code breaks for large integer coordinates since Cross() returns int, which can easily overflow. IListSource was explained in an earlier post and can be replaced with IEnumerable if you prefer.
Edit: There is now a LoycCore package in NuGet that includes Loyc.Essentials.dll which contains data types (e.g. DList) used by ComputeConvexHull(), as well as Loyc.Utilities.dll which contains ComputeConvexHull() itself.
Edit: original version was buggy.
Edit: I found that the algorithm took 11.3% less time for 10 million points if I changed the sorting step as follows:
points.Sort((a, b) => a.X == b.X ? a.Y.CompareTo(b.Y) : a.X.CompareTo(b.X)); // BEFORE a.X == b.X ? a.Y.CompareTo(b.Y) : (a.X > b.X ? 1 : -1)); // AFTERI wrote this little benchmark:
public static void TestConvexHull() { int[] testSizes = new int[] { 12345, 100, 316, 1000, 3160, 10000, 31600, 100000, 316000, 1000000, 3160000, 10000000 }; for (int iter = 0; iter < testSizes.Length; iter++) { Random r = new Random(); List<PointD> points = new List<PointD>(testSizes[iter]); for (int i = 0; i < points.Capacity; i++) { double size = r.NextDouble(); double ang = r.NextDouble() * (Math.PI * 2); points.Add(new PointD(size * Math.Cos(ang), size * Math.Sin(ang))); } // Bonus: test sorting time to learn how much of the time is spent sorting var points2 = new List<PointD>(points); EzStopwatch timer = new EzStopwatch(true); points2.Sort((a, b) => a.X == b.X ? a.Y.CompareTo(b.Y) : (a.X < b.X ? -1 : 1)); int sortTime = timer.Restart(); var output = PointMath.ComputeConvexHull(points, true); int hullTime = timer.Millisec; if (iter == 0) continue; // first iteration primes the JIT/caches Console.WriteLine("Convex hull of {0,8} points took {1} ms ({2} ms for sorting step). Output has {3} points.", testSizes[iter], hullTime, sortTime, output.Count); } }Which produces these results on my laptop:
Convex hull of 100 points took 0 ms (0 ms for sorting step). Output has 12 points. Convex hull of 316 points took 0 ms (0 ms for sorting step). Output has 18 points. Convex hull of 1000 points took 0 ms (0 ms for sorting step). Output has 28 points. Convex hull of 3160 points took 2 ms (1 ms for sorting step). Output has 36 points. Convex hull of 10000 points took 5 ms (3 ms for sorting step). Output has 59 points. Convex hull of 31600 points took 17 ms (10 ms for sorting step). Output has 85 points. Convex hull of 100000 points took 55 ms (37 ms for sorting step). Output has 127 points. Convex hull of 316000 points took 197 ms (122 ms for sorting step). Output has 184 points. Convex hull of 1000000 points took 636 ms (411 ms for sorting step). Output has 286 points. Convex hull of 3160000 points took 2045 ms (1368 ms for sorting step). Output has 379 points. Convex hull of 10000000 points took 6949 ms (4953 ms for sorting step). Output has 563 points.Notice that the sorting step takes the majority of the time, as you would expect since sorting is O(N log N) while the rest of the code is O(N). (Why 316? Logarithmically, it's half way between 100 and 1000.)
Copyright: me (David Piepgrass). Do with the code as you wish (originally LGPL).
Tuesday, April 22, 2014
Open Letter to ???
I have a letter I would like to send to relevant leaders at Google and Mozilla, but I'm not sure to whom I send it. I did send it to Javascript inventor and Mozilla CEO Brendan Eich--just before he stepped down for holding a conservative political belief. Who to try next, I'm not sure; it seems that Mozilla is devoted to the one-language browser (Javascript) while Google supports a single language (Dart, which is "a better Javascript") with support for compiled languages "tacked on" (NaCl). I believe in a third way, but to whom could I send this, who is out there that might agree with me? So, here's the pitch:
Right now there's strong interest in "transpiling" other languages to Javascript, which strikes me as a difficult process that cannot work for all source languages, may have various limitations when it does work, and may produce code that is much slower than code compiled directly from the source language.
I think the interest in transpilers is really a sign that developers want freedom to choose their own programming language for the client side. Devs should not have to use two different codebases for client and server, nor should the web browser dictate the choice of server language. I don't believe in "one language to rule them all" and I am hoping you don't either (indeed, Google and Mozilla are working on multiple programming languages each). However, I do think devs should be allowed to choose one language, and not have the choice dictated "from on high". I want to help create freedom of programming languages on the web, but it's such a major task that I could only do it as part of a Google or Mozilla team.
What I'd like to see is a new VM that is specifically designed to run a wide variety of languages with high performance, in contrast to the Java VM which was designed for one language (and not for high performance), and the CLR which was designed for a few specific languages (C#, VB, C++) but turns out to be unable to handle key features of many other languages. The CLR wouldn't support D or Go very well and probably not Rust either; for example, there are two general-purpose pointer types in the CLR: one type can point to the GC heap and one type can point to non-GC "native" heaps, but you cannot have a pointer that can point "anywhere", which is a feature D requires. So we should create a new VM, which
Thanks for listening. I would love to be a part of a team that creates the VM, but mainly I hope Mozilla or Google will consider moving in this direction. Whether I get to be a part of it or not, the important thing is that somebody makes it happen. Because the software industry needs this.
Sincerely,
David Piepgrass
< Published on CodeProject >
Right now there's strong interest in "transpiling" other languages to Javascript, which strikes me as a difficult process that cannot work for all source languages, may have various limitations when it does work, and may produce code that is much slower than code compiled directly from the source language.
I think the interest in transpilers is really a sign that developers want freedom to choose their own programming language for the client side. Devs should not have to use two different codebases for client and server, nor should the web browser dictate the choice of server language. I don't believe in "one language to rule them all" and I am hoping you don't either (indeed, Google and Mozilla are working on multiple programming languages each). However, I do think devs should be allowed to choose one language, and not have the choice dictated "from on high". I want to help create freedom of programming languages on the web, but it's such a major task that I could only do it as part of a Google or Mozilla team.
What I'd like to see is a new VM that is specifically designed to run a wide variety of languages with high performance, in contrast to the Java VM which was designed for one language (and not for high performance), and the CLR which was designed for a few specific languages (C#, VB, C++) but turns out to be unable to handle key features of many other languages. The CLR wouldn't support D or Go very well and probably not Rust either; for example, there are two general-purpose pointer types in the CLR: one type can point to the GC heap and one type can point to non-GC "native" heaps, but you cannot have a pointer that can point "anywhere", which is a feature D requires. So we should create a new VM, which
- should contain a variety of flexible primitives so that the core features of all known languages can be supported in principle
- should be high-performance, like Google NaCl. Some apps need C-like performance, there's no doubt about it or way around it; the good news is that a single VM can offer high performance and power and ease-of-use all at once.
- should use a bytecode that is designed for optimization (LLVM bitcode seems like a reasonable starting point), but is non-obfuscated (fully decompilable) by default.
- should allow and support garbage collection (but not require GC-based memory management)
- should allow multiple threads or some other parallel processing paradigm
- should be designed to allow code from different languages to communicate easily
- should have a standard library of data types to
- promote interoperability between programming languages, and
- avoid the need to download a large standard library from server to client
Thanks for listening. I would love to be a part of a team that creates the VM, but mainly I hope Mozilla or Google will consider moving in this direction. Whether I get to be a part of it or not, the important thing is that somebody makes it happen. Because the software industry needs this.
Sincerely,
David Piepgrass
< Published on CodeProject >
Saturday, February 22, 2014
# in operator names to be removed
Originally I decided that in Loyc trees, '
Operators seemed special, so I also decided that operators would have the same prefix:
Therefore, I will soon dedicate a day or two to removing the hash characters from the beginning of all operators in both LES and EC#. Possibly, I'll remember to update the documentation too. We'll see.
#
'
would be a prefix that meant "special" or "keyword". So reserved words
or language constructs would be represented as identifiers that start
with '#
', e.g. #class
, #enum
, #for
, #if
, #while
, etc.Operators seemed special, so I also decided that operators would have the same prefix:
#+
, #==
, #>>
, etc. I've changed my mind, though, for two reasons:
- I no longer think operators are so special. LES and some other languages have an infinite number of potential operators and they're treated identically to normal functions, apart from their syntax.
- No language I know of uses the specific prefix '
#
' - Most importantly, the syntax for naming operators turned out to be cumbersome: because an operator name like
#+
generally cannot be used directly as an identifier in most languages, some sort of escaping mechanism is needed. In LES and EC# you have to write, in general,@`special identifier`
(not just`special identifier`
because the latter is treated as an operator, not an identifier). Therefore the identifier#+
is written@`#+`
(@#+
also works in LES), which is a bit cumbersome and (more importantly IMO) creates a teaching challenge: I have to explain both what#
is for and also what@
and the backticks are for, and the student must remember not to mix up@
and#
. Plus, when someone encounters LES or EC# code for the first time it'll be harder to guess what@`#+`
means compared to@`+`
, and Google won't help ("Your search - @`#+` - did not match any documents.")
operator+
, Haskell (+)
, and Nemerle @+
. In all of these cases the additional characters near +
can be understood as an escaping mechanism, so that the actual identifier name is simply +
.Therefore, I will soon dedicate a day or two to removing the hash characters from the beginning of all operators in both LES and EC#. Possibly, I'll remember to update the documentation too. We'll see.
Monday, February 3, 2014
Using Loyc.Essentials: collection interfaces
Loyc.Essentials is a library of "stuff that should be in the .NET BCL, but isn't." Today I will present one of the most important parts of Loyc.Essentials: its collection interfaces.
The .NET 4 version of Loyc.Essentials defines the .NET 4.5 interfaces IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T> (handy if you are still targeting Windows XP, which does not support .NET 4.5), and on top of that defines numerous additional interfaces for queues, stacks, slices, bidirectional enumerators, ranges, sinks, and more. (The new .NET 3.5 version relies on Theraot.Core instead to provide these interfaces and other features of .NET 4/4.5.) Several of these interfaces are widely used throughout Loyc.Essentials itself and other projects such as Loyc.Collections and LLLPG.
Programmers with less experience, or that don't work in areas where performance is important, may underestimate the importance of data structures and collection interfaces in software engineering. I have come to believe that collection interfaces are very important, and that the built-in collection interfaces of the .NET framework are somewhat impoverished and inflexible. I like to work with a variety of data structures and I like to have a variety of collection interfaces available to reflect that variety. There's so much more out there than just List<T>: linked lists, B+ trees, VLists, ALists, Hash Array Mapped Tries, Bloom filters, other search trees and tries... the possibilities are endless.
First, let me talk a little about the limitations that the current BCL interfaces suffer from.
The .NET framework does not support much diversity. Traditionally it defined just four collection interfaces: IEnumerable<T>, ICollection<T>, IList<T>, and IDictionary<T>, and all but one were mutable.
Flaw #1: IList<T>/ICollection<T> contain some methods that are almost never used, like CopyTo(), but lack methods that are often desired, like AddRange() and RemoveRange(), and they lack methods that would be often used if they existed, like Slice().
Flaw #2: Lack of support for read-only interfaces. In .NET 4.5, they finally introduced IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T>, but they didn't derive IList<T> from IReadOnlyList<T> and they didn't declare that the core collection classes actually implement these interfaces. Plus, "IReadOnlyCollection" is an terribly long name for simply adding a Count property to IEnumerable (Loyc.Essentials used to define exactly the same interface, but it was called IEnumerableCount<T>).
Ideologically, I'm opposed to the idea that a mutable interface such as IList<T> should not implement IReadOnlyList<T>. If a method takes an argument of type IReadOnlyList<T>, it should mean that the method will not modify the list; it should not mean that the caller has to create a special read-only wrapper object on the heap, and it certainly should not mean that the caller has to guarantee that the list will never change. IReadOnlyList<T> should not be a burden or difficulty that the caller must overcome, rather it should simply be a promise that the called method will not modify the list. That leads me to flaw #3...
Flaw #3: because IReadOnlyList<T> and IList<T> are officially unrelated, and IReadOnlyList<T> is not supported very well, if a data structure implements both interfaces, C# "ambiguity" errors are inevitable. Here's how it typically happens: you want to declare a method that takes a list but doesn't modify it:
Other collection libraries outside .NET have recognized this diversity, and provide more concepts or interfaces to address it, such as the C++ STL's various iterator and container concepts, or D's ranges.
Suppose I write a method that reads a list starting at the outside and moving inward (e.g. first element, last element, second element, second-last element, etc.). A linked list can certainly support this order of traversal; unfortunately, in .NET, the only standard collection interface that can be used in this situation is IList<T>, which a linked list data structure cannot support efficiently. Thanks to .NET's simplistic interfaces, my method can either support standard linked lists, or normal lists, but not both at once.
This is the central problem that the .NET interfaces fail to solve. When you write a method or a class that operates on a collection, you cannot declare precisely what kind of collection you accept, so most of the time you have to mandate functionality you don't need. Some collection types will not be able to provide that functionality that you mandated but didn't need, so they won't implement the interface and will be incompatible, unnecessarily. Meanwhile, other collection types might implement the interface even though they are not compatible, and then throw an exception when you try to use unsupported functionality.
Interfaces are cheap to define: they are only a few lines of code plus some documentation, so it's well worth the effort of defining a variety of them, even if no implementations are provided (too bad Microsoft didn't see it that way). Before I get to the actual interfaces, here's a list of the key concepts that are supported:
First, an interface you won't use: ICount:
My single most favorite interface in Loyc.Essentials is IListSource<T>. As its name implies, it is a kind of source; it's basically IReadOnlyCollection<T> with extra functionality:
I don't know about you, but in my code I often have to check whether the index is in range before I call the indexer:
The second extension method uses ref T value to allow the caller to set a default T value before calling the method, since you will not always want to use default(T) as the default. If the index is invalid, value is left unchanged.
Meanwhile, the Slice() method returns a sub-section of the list without allocating a new list. For example, list.Slice(10, 10) gets the collection of 10 items starting at index 10. IRange<T> is a superset of IListSource<T>; see definition below. Don't worry, if you're creating a collection class, you never actually have to write a class that implements IRange<T>. Just use this implementation:
Next up, here are the "neg lists". These are list interfaces that do not (necessarily) use zero as the minimum index. So far, I haven't used these interfaces in practice.
Next there's a higher-performance variation on INotifyCollectionChanged, which can only be implemented by collections that implement IListSource<T>:
Admittedly, I have found that sometimes an event handler would really prefer to see the list after it changes rather than before. But if performance matters, this interface may be better. You be the judge.
Next I present this peculiar bunch of interfaces, which reflects the difficulty I had designing an interface for immutable sets:
Finally, I created an InvertibleSet<T> class, which can represent everything that is not in some other set; to account for this possibility, I added the IsInverted property to ISetImm.
Sparse lists are indexed lists in which regions of indexes may be unused. Here are the sparse list interfaces I defined:
The motivating example that led me to create these interfaces is the concept of anchor points in a text editor, such as the locations of error messages, breakpoints, and syntax highlighting markers. Clearly, this data is sparse: the majority of characters do not have an associated error message or color change, so ideally we should store this information in some sort of "compressed" data structure. As the user types new text in the middle of the file, all of these associated locations that appear in the second half should be shifted; that's what a sparse list does. Loyc.Collections offers one implementation if ISparseList<T>, called SparseAList<T>. It is not always a fast data structure, but it scales up effectively to very large files.
Next up, here are the IQueue, IStack and IDeque interfaces, with associated extension methods for added convenience, which are mostly self-explanatory if you know what queues and stacks are:
Here's an interface that models built-in arrays:
Next, here are some interfaces that add or remove items "in bulk":
Next up, here are some interfaces that are used to work around the dreaded ambiguity errors that are caused by the fact that ICollection<T> is officially unrelated to IReadOnlyCollection<T>, and the fact that IList<T> is officially unrelated to IReadOnlyList<T>:
By itself, these interfaces don't solve any problems. To recap, if you write these two methods:
Next, here are some "enhanced" collection interfaces, which just add extra functionality to the standard interfaces. ICollectionEx combines ICollection with AddRange(list) and RemoveAll(lambda), while IListEx combines IList with ICollectionEx, InsertRange(index, list) and RemoveRange(index, count):
I actually wrote a whole implementation of LINQ for Iterator and IIterable<T> (the counterpart to IEnumerable<T>), but then I decided to abandon the concept before I got around to benchmarking the LINQ implementation (early microbenchmarks showed that IIterable was modestly faster when called directly, but not when using the extension method).
Ultimately I decided that I didn't want to burden potential implementors of collection classes by including IIterable<T> in the interface hierarchy. that is, it seemed natural to derive other interfaces such as IListSource<T> from IIterable<, but this placed a substantial burden on my collection class implementations, and might have confused end-users. Obviously, if you're writing a collection class, you want to support things like foreach loops and traditional LINQ, so you will definitely implement IEnumerable<T>; IIterable<T> can only be a bonus, not the central mechanism of enumeration. Then you have the dillema of whether to optimize your code for IEnumerable<T> or IIterable<T>, plus there's the fact that C#'s iterator feature supports only IEnumerator, not Iterator. So I found that the implementation costs for Iterator tended to be high, while few people would appreciate or take advantage of the performance enhancement.
For certain applications that need both high performance and flexibility, another technique for avoiding interface calls makes sense: returning data in groups, such as arrays. Consider this method of ICollection<T> that is almost never used:
For Loyc, I decided to define a specialized interface for lexers:
StringSlice is a structure in Loyc.Essentials that represents a slice of a string (a tuple of (string str, int startIndex, int length)). Slice() is a compromise that allows the lexer to request small pieces of the file that it can read without dynamic dispatch. If the file happens to be stored as a string, Slice() can simply return a slice of that string (no memory allocation required), otherwise it will build a string and return that.
It's unfortunate that .NET treats strings as something completely different than arrays. In a perfect world (or in D), an interface like this one could support not just substrings, but read-only subarrays of any element type, which would be useful any time you want to optimize your code by reducing dynamic dispatch.
As I said, the possible combinations of operations that data structures may or may not support is just about endless. Some may support adding items but not removing; others may support removing but not adding; some lists may be read-only but have a specialized, fast IndexOf() method (recall that IListSource<T> does not include IndexOf()). It would be crazy to define a separate interface, and invent a separate name, for every possible combination of operations. Inevitably, there are some data structures for which Loyc.Essentials will offer no appropriate interface. But in .NET, that's inevitable.
My favorite thing about the Go language (which, admittedly, I haven't actually used) is the way they do interfaces. In Go, a data structure does not have to explicitly declare in advance every interface it satisfies. Instead, a data structure can satisfy any interface defined by a third party if it provides the necessary operations. If .NET worked like this, then it wouldn't matter that List<T> does not say it implements IReadOnlyList<T>; you would still be allowed to pass it to a method that takes IReadOnlyList<T>.
So, suppose I define an indexable data structure that supports add-at-the-end, but no other modifications. Then I could define an interface that exactly matches my data structure:
IMO, this is how it should be, and I would point out that by not supporting this feature, it is not practical for .NET as a platform to support languages like Go that do support it (so much for being a "common language runtime"). Now, I assume we can't add this feature directly to C# because it would break existing code, but the feature could still be supported in .NET itself, if Microsoft had the willpower.
Some people argue against this feature, saying that an "interface" is more than just a set of methods but a contract for how those methods behave. And in Loyc.Essentials itself you'll see two interfaces that reflect that idea:
Another argument in favor of Go interfaces is that they make it easier to satisfy dependencies. For example, I defined ICharSource inside Loyc.Essentials even though the interface was designed for lexers (which suggests Loyc.Syntax.dll would be a better place for it), because Loyc.Essentials contains StringSlice and I wanted StringSlice to implement ICharSource. Since Loyc.Syntax depends on Loyc.Essentials, the only way this can work is if ICharSource is part of Loyc.Essentials. Luckily I control both DLLs, otherwise StringSlice could not implement ICharSource.
In general, I find it to be a big nuisance that when two DLLs define identical interfaces, with the same name in the same namespace, the CLR considers them completely unrelated and incompatible, and there is no workaround. If two DLLs want to implement the same interface, they must always have a reference to a third DLL that contains the interface, and both DLLs must rely on the same version of that third DLL even if the interface in question is identical across different versions. The Go approach gets you out of versioning hell (at least it would in the .NET context. Whether Go itself has a similar issue, I have no idea). But I digress...
Some of these interfaces are used heavily in Loyc, while others are not used at all. In future posts I'll introduce the other classes, structures, and handy methods of Loyc.Essentials.dll, and then I'll move on to Loyc.Collections.dll.
Edit: I missed an interesting interface: IEnumeratorFrame<Frame, T>, which goes with the NestedEnumerator<Frame, T> structure, and is used for enumerating tree-like data structures or manually-implemented coroutines. See the code if you're curious, although this interface would work best as part of a built-in compiler feature for stream flattening like Cω had.
The .NET 4 version of Loyc.Essentials defines the .NET 4.5 interfaces IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T> (handy if you are still targeting Windows XP, which does not support .NET 4.5), and on top of that defines numerous additional interfaces for queues, stacks, slices, bidirectional enumerators, ranges, sinks, and more. (The new .NET 3.5 version relies on Theraot.Core instead to provide these interfaces and other features of .NET 4/4.5.) Several of these interfaces are widely used throughout Loyc.Essentials itself and other projects such as Loyc.Collections and LLLPG.
Programmers with less experience, or that don't work in areas where performance is important, may underestimate the importance of data structures and collection interfaces in software engineering. I have come to believe that collection interfaces are very important, and that the built-in collection interfaces of the .NET framework are somewhat impoverished and inflexible. I like to work with a variety of data structures and I like to have a variety of collection interfaces available to reflect that variety. There's so much more out there than just List<T>: linked lists, B+ trees, VLists, ALists, Hash Array Mapped Tries, Bloom filters, other search trees and tries... the possibilities are endless.
First, let me talk a little about the limitations that the current BCL interfaces suffer from.
The .NET framework does not support much diversity. Traditionally it defined just four collection interfaces: IEnumerable<T>, ICollection<T>, IList<T>, and IDictionary<T>, and all but one were mutable.
Flaw #1: IList<T>/ICollection<T> contain some methods that are almost never used, like CopyTo(), but lack methods that are often desired, like AddRange() and RemoveRange(), and they lack methods that would be often used if they existed, like Slice().
Flaw #2: Lack of support for read-only interfaces. In .NET 4.5, they finally introduced IReadOnlyCollection<T>, IReadOnlyList<T> and IReadOnlyDictionary<T>, but they didn't derive IList<T> from IReadOnlyList<T> and they didn't declare that the core collection classes actually implement these interfaces. Plus, "IReadOnlyCollection" is an terribly long name for simply adding a Count property to IEnumerable (Loyc.Essentials used to define exactly the same interface, but it was called IEnumerableCount<T>).
Ideologically, I'm opposed to the idea that a mutable interface such as IList<T> should not implement IReadOnlyList<T>. If a method takes an argument of type IReadOnlyList<T>, it should mean that the method will not modify the list; it should not mean that the caller has to create a special read-only wrapper object on the heap, and it certainly should not mean that the caller has to guarantee that the list will never change. IReadOnlyList<T> should not be a burden or difficulty that the caller must overcome, rather it should simply be a promise that the called method will not modify the list. That leads me to flaw #3...
Flaw #3: because IReadOnlyList<T> and IList<T> are officially unrelated, and IReadOnlyList<T> is not supported very well, if a data structure implements both interfaces, C# "ambiguity" errors are inevitable. Here's how it typically happens: you want to declare a method that takes a list but doesn't modify it:
void Foo(IReadOnlyList<T> list) {...}But you notice that most collection classes don't implement IReadOnlyList<T>, so you add an overload that automatically converts IList<T> to IReadOnlyList<T> with an extension method (e.g. AsListSource() in the Loyc.Collections namespace):
// Alias for Foo(IReadOnlyList<T>) void Foo(IList<T> list) { Foo(list.AsListSource()); }So far, so good. But as new data structures are implemented, they will naturally implement IReadOnlyList<T>. If they are mutable, they will want to implement IReadOnlyList<T> so that it will be possible to pass them to methods that accept IReadOnlyList; meanwhile, immutable data structures might choose to explicitly implement IList<T> in order to stay compatible with the large amount of older code based on IList<T> for read-only access. So what's the problem? Suppose I have a data structure MyList<T> that implements both interfaces. Look what happens when I call Foo:
Foo(new MyList<T>()); // ERROR: The call is ambiguous // between the following methods or properties: // '...Foo(System.Collections.Generic.IList<T>)' and // '...Foo(System.Collections.Generic.IReadOnlyList<T>)'Flaw #4: A lack of variety and flexibility. If you actually survey the various data structures that exist, you'll find a wide variety of capabilities that they may or may not offer.
- A data structure might support an indexing operator but not a Count property (consider the Sieve of Eratosthenes of infinite size). Other data structures may support a Count property in theory, but calling it should be avoided because it may be expensive (in Loyc.Essentials, BufferedSequence<T> has this property).
- A data structures may be queue- or stack-like, offering access to the beginning/end of a data structure but not the middle.
- A data structure might not be zero-based, e.g. the range of valid indexes could be 1..100 or -100..100 or 35..39.
- A data structure may be sparse, in which case scanning it for items with a for-loop from 0..Count is inefficient.
- A data structure may be a sink, allowing insertion of data but not reading (write-only files and pipes may be viewed this way).
- Most list data structures could allow enumeration to start at any point in the middle, but .NET ignores this possibility; in addition, indexing is less efficient than enumeration for some lists (e.g. AList), so this should be supported.
- A data structures may not support access by integer index, but could still allow bi-directional enumeration moving both forward and backward), e.g. a doubly-linked list, with insertion locations denoted by enumerator positions.
- Many data structures can easily tell you if they are empty, but require O(N) time to report their Count. .NET almost forces you to use Count == 0 to detect emptiness; you can use LINQ's Any() function instead, but this requires a heap allocation and two interface calls.
Other collection libraries outside .NET have recognized this diversity, and provide more concepts or interfaces to address it, such as the C++ STL's various iterator and container concepts, or D's ranges.
Suppose I write a method that reads a list starting at the outside and moving inward (e.g. first element, last element, second element, second-last element, etc.). A linked list can certainly support this order of traversal; unfortunately, in .NET, the only standard collection interface that can be used in this situation is IList<T>, which a linked list data structure cannot support efficiently. Thanks to .NET's simplistic interfaces, my method can either support standard linked lists, or normal lists, but not both at once.
This is the central problem that the .NET interfaces fail to solve. When you write a method or a class that operates on a collection, you cannot declare precisely what kind of collection you accept, so most of the time you have to mandate functionality you don't need. Some collection types will not be able to provide that functionality that you mandated but didn't need, so they won't implement the interface and will be incompatible, unnecessarily. Meanwhile, other collection types might implement the interface even though they are not compatible, and then throw an exception when you try to use unsupported functionality.
How Loyc.Essentials improves the situation
Loyc.Essentials improves the situation mainly by providing a much wider variety of interfaces. The new interfaces mostly address flaw #4, but can only partly alleviate the pain of flaws 1, 2 and 3.Interfaces are cheap to define: they are only a few lines of code plus some documentation, so it's well worth the effort of defining a variety of them, even if no implementations are provided (too bad Microsoft didn't see it that way). Before I get to the actual interfaces, here's a list of the key concepts that are supported:
- Sources: objects that provide data. Source interfaces are read-only.
- Sinks: objects that accept data. Sink interfaces are write-only.
- Slices: subsections of an indexed list, e.g. elements 5..10 of a larger list. The broad class of Divide-and-conquer algorithms benefit from these.
- Ranges: Similar to slices, ranges represent a subsection of a collection, but they may only allow access to the first, or first and last, elements of that subsection. For performance reasons, ranges are mutable in the sense that you can shrink the range one element at a time. This does not modify the underlying collection, it just causes the range to point to a smaller area in the same collection. Ranges were popularized by D, and while they don't work as well in .NET, I thought that the interfaces were still worth providing. Here is a long-ish article about my design of Ranges in .NET.
- Fancy enumerators: after designing the range interfaces, I realized that often all you need is an enumerator that can go backward. Binumerators can travel both backward and forward through a collection, and mutable variants allow the current item to be changed or deleted.
- Specific categories of data structures: queues, arrays, and sparse lists.
- "Neg" non-zero-based lists: indexed lists for which the minimum index is not necessarily zero.
The Loyc.Essentials collection interfaces
The full documentation of these interfaces can be seen in the source code and will be provided automatically by Visual Studio Intellisense, as long as you have the Loyc.Essentials.xml file alongside your copy of Loyc.Essentials.dll.First, an interface you won't use: ICount:
public interface ICount : IIsEmpty { int Count { get; } }I thought the concept of "having a count" should be separated because some collections may have a count but do not allow enumeration or vice versa. For example, I defined IStack<T> as a collection that had a Count but did not allow enumeration. However, Microsoft made ICount unusable in .NET 4.5 by defining IReadOnlyCollection:
public interface IReadOnlyCollectionHow does this new interface make it impossible to use ICount? Well, if I define any interface that inherits from both IReadOnlyCollection and ICount, it becomes impossible (well, quite difficult) to call Count on that interface because the C# compiler says that the reference to Count is "ambiguous". The end result is that people won't bother using ICount at all (admittedly, the number of cases where somebody needs the Count of a collection, and nothing else, is very small). For completeness, I recently added IIsEmpty, because as I mentioned, IsEmpty can run much faster than Count in some data structures:: IEnumerable { int Count { get; } }
public interface IIsEmpty { bool IsEmpty { get; } }However, I decided not to mandate this property as part of the common Loyc.Essentials interface IListSource<T>, although it is a part of IFRange<T> (listed near the bottom of this article).
My single most favorite interface in Loyc.Essentials is IListSource<T>. As its name implies, it is a kind of source; it's basically IReadOnlyCollection<T> with extra functionality:
public interface IListSource<out T> : IReadOnlyList<T> { // If index is invalid, sets fail=true and returns default(T) T TryGet(int index, out bool fail); IRange<T> Slice(int start, int count = int.MaxValue); }Both of these new functions are favorites of mine, although it's usually easier to call one of the standard TryGet() extension methods (especially the first one):
public static partial class LCInterfaces { public static T TryGet<T>(this IListSource<T> list, int index, T defaultValue); public static bool TryGet<T>(this IListSource<T> list, int index, ref T value); }Plus, a couple of standard methods of IList<T> are added as extension methods:
public static partial class LCInterfaces { public static int IndexOf<And there are other extension methods that will be covered in a future article.(this IReadOnlyList<T> list, T item) public static void CopyTo<T>(this IReadOnlyList<T> c, T[] array, int arrayIndex) }
I don't know about you, but in my code I often have to check whether the index is in range before I call the indexer:
if (index < list.Count && list[index] >= 0) {...}With IListSource this is easier:
if (list.TryGet(index, -1) >= 0) {...}In theory, this version should be faster too, because it needs only one interface dispatch, not two. That's not the only reason for TryGet() to exist, though; there are a few collection types for which it is expensive to call Count.
The second extension method uses ref T value to allow the caller to set a default T value before calling the method, since you will not always want to use default(T) as the default. If the index is invalid, value is left unchanged.
Meanwhile, the Slice() method returns a sub-section of the list without allocating a new list. For example, list.Slice(10, 10) gets the collection of 10 items starting at index 10. IRange<T> is a superset of IListSource<T>; see definition below. Don't worry, if you're creating a collection class, you never actually have to write a class that implements IRange<T>. Just use this implementation:
public IRange<T> Slice(int start, int count) { return new Slice_<T>(this, start, count); }or the potentially more efficient version,
IRange<T> IListSource<T>.Slice(int start, int count) { return Slice(start, count); } public Slice_<T> Slice(int start, int count) { return new Slice_<T>(this, start, count); }Slice_ is a struct in Loyc.Essentials that provides a "view" on part of a read-only list. Why the underscore? I wanted to simply call it "Slice", but that is illegal in C# because Slice_ itself implements IListSource<T>, so it contains a method named Slice, and a method is not allowed to have the same name as its containing class. By the way, there's also a ListSlice<T> class and Slice() extension method for slicing IList<T>.
Next up, here are the "neg lists". These are list interfaces that do not (necessarily) use zero as the minimum index. So far, I haven't used these interfaces in practice.
public interface INegListSource<T> : IReadOnlyCollection<T> { int Min { get; } int Max { get; } T this[int index] { get; } T TryGet(int index, out bool fail); IRange<T> Slice(int start, int count = int.MaxValue); } public interface INegArrayHere's documentation, if you're interested. INegAutoSizeArray is interesting: it automatically enlarges itself when you write to an index below Min or above Max.: INegListSource { new T this[int index] { set; get; } bool TrySet(int index, T value); } public interface INegAutoSizeArray<T> : INegArray<T> { void Optimize(); } public interface INegDeque<T> : INegArray<T>, IDeque<T> { }
Next there's a higher-performance variation on INotifyCollectionChanged, which can only be implemented by collections that implement IListSource<T>:
public interface INotifyListChanging<T> { event ListChangingHandler<T> ListChanging; } public delegate void ListChangingHandler<T>( IListSource<T> sender, ListChangeInfo<T> args); [Serializable] public class ListChangeInfo<T> : EventArgs { public ListChangeInfo(NotifyCollectionChangedAction action, int index, int sizeChange, IListSource<T> newItems); public readonly NotifyCollectionChangedAction Action; public readonly int Index; public readonly int SizeChange; public readonly IListSource<T> NewItems; }Why is this faster? Because unlike INotifyCollectionChanged, this interface requires as little as one memory allocation each time the list changes (for the ListChangeInfo object). INotifyCollectionChanged requires the collection to allocate a list of OldItems and NewItems; but because ListChanging fires before the list changes rather than afterward, there is no need to allocate a list of old items--the event handler can simply look at the current state of the list. Meanwhile, it is possible to optimize NewItems not to require an allocation.
Admittedly, I have found that sometimes an event handler would really prefer to see the list after it changes rather than before. But if performance matters, this interface may be better. You be the judge.
Next I present this peculiar bunch of interfaces, which reflects the difficulty I had designing an interface for immutable sets:
public interface ISetTests<SetT> { bool IsProperSubsetOf(SetT other); bool IsProperSupersetOf(SetT other); bool IsSubsetOf(SetT other); bool IsSupersetOf(SetT other); bool Overlaps(SetT other); bool SetEquals(SetT other); } public interface ISetOperations<in T, SetT> : ISetOperations<T, SetT, SetT> { } public interface ISetOperations<in T, in InSetT, out OutSetT> { OutSetT With(T item); OutSetT Without(T item); OutSetT Union(InSetT other); OutSetT Intersect(InSetT other); OutSetT Except(InSetT other); OutSetT Xor(InSetT other); } public interface ISetImm<T, SetT> : ISetOperations<T, SetT>, ISetTests<SetT>, IReadOnlyCollection<T> //, ICount { bool IsInverted { get; } } public interface ISetImm<T> : ISetOperations<T, IEnumerable<T>>, ISetTests<IEnumerable<T>>, IReadOnlyCollection<T> { }The most important interface is at the end: ISetImm<T>. If you work through it logically, you can see that ISetImm<T> contains the following members:
public interface ISetImm<T> : ... { bool IsProperSubsetOf(IEnumerable<T> other); bool IsProperSupersetOf(IEnumerable<T> other); bool IsSubsetOf(IEnumerable<T> other); bool IsSupersetOf(IEnumerable<T> other); bool Overlaps(IEnumerable<T> other); bool SetEquals(IEnumerable<T> other); ISetImm<T> With(T item); ISetImm<T> Without(T item); ISetImm<T> Union(IEnumerable<T> other); ISetImm<T> Intersect(IEnumerable<T> other); ISetImm<T> Except(IEnumerable<T> other); ISetImm<T> Xor(IEnumerable<T> other); bool IsInverted { get; } }.NET already defines a set interface for mutable sets, ISet<T>, but immutable sets are more convenient to work with so I added this interface. In Loyc.Collections.dll, I implemented a very nice immutable set class called Set<T>, which is a kind of Hash Tree. Originally my Set<T> structure implemented only ISetImm<T, Set<T>>, because it was easier, but eventually I also implemented the more generic ISetImm<T> interface. I also made a mutable variant of Set<T>, the MSet<T> class, which implements both ISet<T> and ISetImm<T>.
Finally, I created an InvertibleSet<T> class, which can represent everything that is not in some other set; to account for this possibility, I added the IsInverted property to ISetImm.
Sparse lists are indexed lists in which regions of indexes may be unused. Here are the sparse list interfaces I defined:
public interface ISparseListSource<T> : IListSource<T> { int? NextHigher(int index); int? NextLower(int index); bool IsSet(int index); } public partial class LCInterfaces { public static IEnumerable<KeyValuePair<int, T>> Items<T>( this ISparseListSource<T> list); } public interface ISparseList<T> : ISparseListSource<T>, IListAndListSource<T> { void Clear(int index, int count = 1); void InsertSpace(int index, int count = 1); } public interface ISparseListEx<T> : ISparseList<T>, IListEx<T>, IAutoSizeArray<T> { void InsertRange(ISparseListSource<T> list); }ISparseListSource is a read-only sparse list, while ISparseList is mutable and ISparseListEx additionally allows you to insert the contents of one entire sparse list into another. According to the definitions shown here, an ISparseList<T> is fundamentally different from an IDictionary<int, T> because when you insert or remove an item from a sparse list at index i, the index associated with each and every item above index i also changes. A Dictionary can't do that, while a good sparse list implementation can do it efficiently.
The motivating example that led me to create these interfaces is the concept of anchor points in a text editor, such as the locations of error messages, breakpoints, and syntax highlighting markers. Clearly, this data is sparse: the majority of characters do not have an associated error message or color change, so ideally we should store this information in some sort of "compressed" data structure. As the user types new text in the middle of the file, all of these associated locations that appear in the second half should be shifted; that's what a sparse list does. Loyc.Collections offers one implementation if ISparseList<T>, called SparseAList<T>. It is not always a fast data structure, but it scales up effectively to very large files.
Next up, here are the IQueue, IStack and IDeque interfaces, with associated extension methods for added convenience, which are mostly self-explanatory if you know what queues and stacks are:
public interface IPush<in T> { void Push(T item); } public interface IPop<out T> { T TryPop(out bool isEmpty); T TryPeek(out bool isEmpty); bool IsEmpty { get; } } public static partial class LCInterfaces { public static T Pop<T>(this IPop<T> c); public static T Peek<T>(this IPop<T> c); public static bool TryPop<T>(this IPop<T> c, out T value); public static bool TryPeek<T>(this IPop<T> c, out T value); public static T TryPop<T>(this IPop<T> c); public static T TryPop<T>(this IPop<T> c, T defaultValue); public static T TryPeek<T>(this IPop<T> c); public static T TryPeek<T>(this IPop<T> c, T defaultValue); } public interface IQueue<T> : IPush<T>, IPop<T>, ICount { } public interface IStack<T> : IPush<T>, IPop<T>, ICount { } public interface IDeque<T>: ICount, IIsEmpty { void PushFirst(T item); void PushLast(T item); T TryPopFirst(out bool isEmpty); T TryPeekFirst(out bool isEmpty); T TryPopLast(out bool isEmpty); T TryPeekLast(out bool isEmpty); T First { get; set; } T Last { get; set; } bool IsEmpty { get; } } public static partial class LCInterfaces { public static T PopFirst<T>(this IDeque<T> c) public static T PopLast<T>(this IDeque<T> c) public static T PeekFirst<T>(this IDeque<T> c) public static T PeekLast<T>(this IDeque<T> c) public static bool TryPopFirst<T>(this IDeque<T> c, out T value) public static bool TryPopLast<T>(this IDeque<T> c, out T value) public static bool TryPeekFirst<T>(this IDeque<T> c, out T value) public static bool TryPeekLast<T>(this IDeque<T> c, out T value) public static T TryPopFirst<T>(this IDeque<T> c) public static T TryPopFirst<T>(this IDeque<T> c, T defaultValue) public static T TryPopLast<T>(this IDeque<T> c) public static T TryPopLast<T>(this IDeque<T> c, T defaultValue) public static T TryPeekFirst<T>(this IDeque<T> c) public static T TryPeekFirst<T>(this IDeque<T> c, T defaultValue) public static T TryPeekLast<T>(this IDeque<T> c) public static T TryPeekLast<T>(this IDeque<T> c, T defaultValue) }Currently IQueue and IStack are not being used, but IDeque<T> is implemented by DList<T>. Next up, here are the sink interfaces, which let you put data in but not take it out.
/// <summary>An interface for depositing items. Includes only an Add(T) method.</summary> public interface IAdd<in T> { void Add(T item); } /// <summary>Represents a write-only collection: you can modify it, but you /// cannot learn what it contains.</summary> public interface ISinkCollection<in T> : IAdd<T> { void Clear(); bool Remove(T item); } /// <summary>Represents a write-only array.</summary> public interface ISinkArray<in T> { T this[int index] { set; } } /// <summary>Represents a write-only indexable list class.</summary> public interface ISinkList<in T> : ISinkCollection<T>, ISinkArray<T> { }These interfaces are not often useful, but sometimes people actually do write code that only modifies a collection, without reading it. One useful property of these interfaces is contravariance: if a method needs an ISinkCollection<Foo> for example, you can pass it a ISinkCollection<object> instead.
Here's an interface that models built-in arrays:
public interface IArray<T> : IListSource<T>, ISinkArray<T> { new T this[int index] { get; set; } bool TrySet(int index, T value); } public interface IAutoSizeArray<T> : IArray<T> { void Optimize(); }But of course, since T[] does not implement IArray, it's not very useful and is only included for completeness. The derived interface is a little more interesting: it represents a list whose size increases automatically when you write to an index beyond the end of the array. If supported, the Optimize() method optimizes the data structure to consume less storage space (if not supported, it has no effect.)
Next, here are some interfaces that add or remove items "in bulk":
public interface IAddRange<T> : ICount { void AddRange(IEnumerable<T> e); void AddRange(IReadOnlyCollection<T> s); } public interface IListRangeMethods<T> : IAddRange<T> { void InsertRange(int index, IEnumerable<T> e); void InsertRange(int index, IReadOnlyCollection<T> s); void RemoveRange(int index, int amount); } public static partial class LCInterfaces { public static void Resize<T>( this IListRangeMethods<T> list, int newSize); }It's often more efficient to use a bulk insertion or deletion method rather than calling Insert or RemoveAt in a loop.
Next up, here are some interfaces that are used to work around the dreaded ambiguity errors that are caused by the fact that ICollection<T> is officially unrelated to IReadOnlyCollection<T>, and the fact that IList<T> is officially unrelated to IReadOnlyList<T>:
public interface ICollectionAndReadOnly<T> : ICollection<T>, IReadOnlyCollection<T> { } public interface IListAndListSource<T> : IList<T>, IListSource<T>, ICollectionAndReadOnly<T> { }Any collection that implements both ICollection<T> and IReadOnlyCollection<T> should implement ICollectionAndReadOnly<T>, and any collection that implements both IList<T> and IReadOnlyList<T> should go the extra mile and implement IListAndListSource<T>.
By itself, these interfaces don't solve any problems. To recap, if you write these two methods:
void Foo(IReadOnlyList<T> list) {...} void Foo(IList<T> list) { Foo(list.AsListSource()); }The C# compiler will give an "ambiguity" error when you try to call Foo(list), if the list implements both IList<T> and IReadOnlyList<T>. To work around this problem, it is necessary to define a third method that takes IListAndListSource<T>:
void Foo(IReadOnlyList<T> list) {...} void Foo(IList<T> list) { Foo(list.AsListSource()); } void Foo(IListAndListSource<T> list) { Foo((IReadOnlyList<T>) list); }This workaround only works if the list class implements IListAndListSource<T>, so all mutable collections in Loyc.Essentials and Loyc.Collections do so.
Next, here are some "enhanced" collection interfaces, which just add extra functionality to the standard interfaces. ICollectionEx combines ICollection with AddRange(list) and RemoveAll(lambda), while IListEx combines IList with ICollectionEx, InsertRange(index, list) and RemoveRange(index, count):
public interface ICollectionEx<T> : ICollectionAndReadOnly<T>, ISinkCollection<T>, IAddRange<T>, IIsEmpty { } { int RemoveAll(Predicate<T> match); } public interface IListEx<T> : IListAndListSource<T>, ICollectionEx<T>, IArray<T>, IListRangeMethods<T> { }Binumerators are enumerators that can move backward through a collection, not just forward. The IBinumerable interface can be implemented by a collection that supports binumerators. Mutable enumerators additionally allow elements of the collection to be modified during enumeration. Mutable enumerators optionally support removing the current element.
public interface IBinumerator<T> : IEnumerator<T> { bool MovePrev(); } interface IMEnumerator<T> : IEnumerator<T> { /// <summary>Gets or sets the value of the current item.</summary> new T Current { get; set; } /// <summary>Removes the current item and moves to the next one. Remember /// NOT to call MoveNext() immediately after Remove().</summary> /// <returns>True if there is a next item after this one, /// false if the removed item was the last one.</returns> /// <exception cref="NotSupportedException">The collection does not permit /// this operation.</exception> bool Remove(); } interface IMBinumerator<T> : IBinumerator<T>, IMEnumerator<T> { } interface IBinumerable<T> { /// <summary>Returns a binumerator that points before the beginning of /// the current collection.</summary> IBinumerator<T> Begin(); /// <summary>Returns a binumerator that points after the end of the /// current collection.</summary> IBinumerator<T> End(); }Finally, here are the range interfaces, which were inspired by the D programming language. These were described in a much earlier article but I tweaked the interfaces after that article was written. You can see the current documentation in the source code; I won't say anything more about these interfaces here, since this article is getting pretty long already.
public interface IFRange<out T> : IEnumerable<T>, ICloneable<IFRange<T>>, IIsEmpty { T Front { get; } T PopFront(out bool fail); } public interface IMFRange<T> : IFRange<T> { new T Front { get; set; } } public interface IBRange<out T> : IFRange<T>, ICloneable<IBRange<T>> { T Back { get; } T PopBack(out bool fail); } public interface IMBRange<T> : IBRange<T>, IMFRange<T> { new T Back { get; set; } } public interface IRange<out T> : IBRange<T>, IListSource<T>, ICloneable<IRange<T>> { } public interface IBRangeEx<R, T> : IBRange<T> where R : IBRangeEx<R, T>, ICloneable<R> { IEnumerable<T> InnerList { get; } int SliceStart { get; } R Intersect(R other); R Union(R other); } public interface IBRangeEx<T> : IBRangeEx<IBRangeEx<T>, T>, ICloneable<IBRangeEx<T>> { } public interface IRangeEx<R, T> : IRange<T>, IBRangeEx<R, T> where R : IRangeEx<R, T>, ICloneable<R> { } public interface IRangeEx<T> : IRangeEx<IRangeEx<T>, T>, ICloneable<IRangeEx<T>> { } public interface IMRange<T> : IMBRange<T>, IRange<T> { new T this[int index] { get; set; } }That's it! That's all the general-purpose collection interfaces.
Performance: one flaw I didn't fix
It bothers me that the IEnumerator<T> interface requires two interface calls to retrieve each element of a collection: MoveNext() and Current. That's just a pointless waste of clock cycles. In the past, I defined an interface that fixes that problem by moving to the next item and returning it with a single method call. It was called Iterator<T> and it was a delegate (alternate possible design: a derived interface of IEnumerator<T>):public delegate T Iterator<out T>(ref bool fail);Note that I really would have preferred to use the signature bool Iterator<T>(out T value) instead, but .NET this second version of the method is illegal. Why? Unfortunately, the CLR actually doesn't have the concept of an "out" parameter; to the CLR, "out" and "ref" are the same thing. Therefore, type T would have to be invariant, <out T> would not be allowed. The latter signature can still be provided as an extension method, at the cost of some performance.
I actually wrote a whole implementation of LINQ for Iterator and IIterable<T> (the counterpart to IEnumerable<T>), but then I decided to abandon the concept before I got around to benchmarking the LINQ implementation (early microbenchmarks showed that IIterable was modestly faster when called directly, but not when using the extension method).
Ultimately I decided that I didn't want to burden potential implementors of collection classes by including IIterable<T> in the interface hierarchy. that is, it seemed natural to derive other interfaces such as IListSource<T> from IIterable<, but this placed a substantial burden on my collection class implementations, and might have confused end-users. Obviously, if you're writing a collection class, you want to support things like foreach loops and traditional LINQ, so you will definitely implement IEnumerable<T>; IIterable<T> can only be a bonus, not the central mechanism of enumeration. Then you have the dillema of whether to optimize your code for IEnumerable<T> or IIterable<T>, plus there's the fact that C#'s iterator feature supports only IEnumerator, not Iterator. So I found that the implementation costs for Iterator tended to be high, while few people would appreciate or take advantage of the performance enhancement.
For certain applications that need both high performance and flexibility, another technique for avoiding interface calls makes sense: returning data in groups, such as arrays. Consider this method of ICollection<T> that is almost never used:
void CopyTo(T[] array, int arrayIndex);Imagine that method was gone and replaced with this method in IList:
// copy this[startIndex..startIndex+count] to array[arrayIndex..arrayIndex+count] void CopySlice(T[] array, int arrayIndex, int startIndex, int count);Now you would have a technique for optimizing access when you need it. Rather than requesting the entire list as an array, which requires unbounded memory, you request a section. How can this be used to optimize a program? Well, you can read the list in small blocks, e.g., 50 elements at a time. Then you can use a for-loop like this one:
for (int i = 0; i < array.Length; i++) { /* do something with array[i] */ }which avoids the cost of (1) interface calls and (2) bounds checking (the JIT eliminates the bounds check in a loop that is structured this way), in exchange for the need to allocate a small temporary array (which is usually cheap). Obviously, this is not a convenient way to access a list; again, this would only be for the (rare?) situation that calls for high performance, but still requires the flexibility of interfaces (so you can swap in multiple implementations).
For Loyc, I decided to define a specialized interface for lexers:
public interface ICharSource : IListSource<char> { /// <summary> /// Returns a substring from the character source. If some of the /// requested characters are past the end of the stream, the string /// is truncated to the available number of characters. /// </summary> /// <param name="startIndex">Index of first character to return. /// If startIndex >= Count, an empty string is returned.</param> /// <param name="length">Number of characters desired.</param> new StringSlice Slice(int startIndex, int length); }This interface was created to read characters efficiently. Although a lexer could read characters one-at-a-time from IReadOnlyList<char> or IListSource<char>, it requires at least one dynamic interface dispatch for every character (multiple, if the lexer reads the same character more than once). On the other hand, if lexers avoid this overhead by requiring the entire file in the form of a string, it becomes necessary to hold the entire file in memory at once, in a very specific format (one huge string), which may not be ideal.
StringSlice is a structure in Loyc.Essentials that represents a slice of a string (a tuple of (string str, int startIndex, int length)). Slice() is a compromise that allows the lexer to request small pieces of the file that it can read without dynamic dispatch. If the file happens to be stored as a string, Slice() can simply return a slice of that string (no memory allocation required), otherwise it will build a string and return that.
It's unfortunate that .NET treats strings as something completely different than arrays. In a perfect world (or in D), an interface like this one could support not just substrings, but read-only subarrays of any element type, which would be useful any time you want to optimize your code by reducing dynamic dispatch.
By the way: we're doing it wrong.
Defining zillions of interfaces isn't the best solution.As I said, the possible combinations of operations that data structures may or may not support is just about endless. Some may support adding items but not removing; others may support removing but not adding; some lists may be read-only but have a specialized, fast IndexOf() method (recall that IListSource<T> does not include IndexOf()). It would be crazy to define a separate interface, and invent a separate name, for every possible combination of operations. Inevitably, there are some data structures for which Loyc.Essentials will offer no appropriate interface. But in .NET, that's inevitable.
My favorite thing about the Go language (which, admittedly, I haven't actually used) is the way they do interfaces. In Go, a data structure does not have to explicitly declare in advance every interface it satisfies. Instead, a data structure can satisfy any interface defined by a third party if it provides the necessary operations. If .NET worked like this, then it wouldn't matter that List<T> does not say it implements IReadOnlyList<T>; you would still be allowed to pass it to a method that takes IReadOnlyList<T>.
So, suppose I define an indexable data structure that supports add-at-the-end, but no other modifications. Then I could define an interface that exactly matches my data structure:
interface IAppendableList<T> : IReadOnlyList<T> { void Add(T item); } class GuestBook : IAppendableList<GuestEntry> { ... }Later, someone else who is using MyDataStructure<T> in their code could decide to declare a method that accepts IAppendableList<T>:
void Foo(IAppendableList<T> x) { ... }We can tell that Foo() will probably add items to x, but won't remove items. If .NET supported Go's ability to adapt to new interfaces with almost no extra run-time cost, it would not only accept an argument of type GuestBook, but it would also accept any of the standard collection classes like List<T>. Nice. This feature could neatly fix flaws 2 and 3 above, even if Microsoft itself never defined any read-only or specialized interfaces.
IMO, this is how it should be, and I would point out that by not supporting this feature, it is not practical for .NET as a platform to support languages like Go that do support it (so much for being a "common language runtime"). Now, I assume we can't add this feature directly to C# because it would break existing code, but the feature could still be supported in .NET itself, if Microsoft had the willpower.
Some people argue against this feature, saying that an "interface" is more than just a set of methods but a contract for how those methods behave. And in Loyc.Essentials itself you'll see two interfaces that reflect that idea:
public interface IQueue<T> : IPush<T>, IPop<T>, ICount { } public interface IStack<T> : IPush<T>, IPop<T>, ICount { }The way I defined queues and stacks is that both of them have the same methods--pushing, popping, and Count--but because they behave quite differently, I defined two separate interfaces. So the argument against "Go interfaces" makes sense, but in my opinion their advantages clearly outweigh the disadvantages, and anyway the feature already exists in some languages, so if Microsoft truly wants a many-language platform, they must make features like this possible, and efficient.
Another argument in favor of Go interfaces is that they make it easier to satisfy dependencies. For example, I defined ICharSource inside Loyc.Essentials even though the interface was designed for lexers (which suggests Loyc.Syntax.dll would be a better place for it), because Loyc.Essentials contains StringSlice and I wanted StringSlice to implement ICharSource. Since Loyc.Syntax depends on Loyc.Essentials, the only way this can work is if ICharSource is part of Loyc.Essentials. Luckily I control both DLLs, otherwise StringSlice could not implement ICharSource.
In general, I find it to be a big nuisance that when two DLLs define identical interfaces, with the same name in the same namespace, the CLR considers them completely unrelated and incompatible, and there is no workaround. If two DLLs want to implement the same interface, they must always have a reference to a third DLL that contains the interface, and both DLLs must rely on the same version of that third DLL even if the interface in question is identical across different versions. The Go approach gets you out of versioning hell (at least it would in the .NET context. Whether Go itself has a similar issue, I have no idea). But I digress...
Conclusion.
Loyc.Essentials is a library of "stuff that should be in the BCL, but isn't." Although it was created for the Loyc project, it is meant as a general-purpose library for any developer's .NET toolbox. If there's a general-purpose interface that you think should be part of the BCL, write it in a comment and we'll discuss whether it should be added to Loyc.Essentials.dll.Some of these interfaces are used heavily in Loyc, while others are not used at all. In future posts I'll introduce the other classes, structures, and handy methods of Loyc.Essentials.dll, and then I'll move on to Loyc.Collections.dll.
Edit: I missed an interesting interface: IEnumeratorFrame<Frame, T>, which goes with the NestedEnumerator<Frame, T> structure, and is used for enumerating tree-like data structures or manually-implemented coroutines. See the code if you're curious, although this interface would work best as part of a built-in compiler feature for stream flattening like Cω had.
Friday, January 31, 2014
Using Loyc.Essentials: introduction
Loyc.Essentials is a library (DLL) of general-purpose code that supplements the .NET BCL (standard libraries). I hope that one day it will serve as a starting point or inspiration for a multi-language standard library, which will be a standard library that you can use in a wide variety of programming languages (including outside the .NET ecosystem). Currently, however, Loyc.Essentials is designed specifically for .NET, and it is designed to supplement, not replace, the core BCL classes. The Loyc libraries contain only "safe", verifiable code.
Loyc.Essentials contains the following categories of stuff:
Loyc.Essentials contains the following categories of stuff:
- Collection stuff: interfaces, adapters, helper classes, base classes, extension methods, and implementations for simple "core" collections.
- Geometry: simple generic geometric interfaces and classes: e.g. IPoint<T>, IPoint3<T>, Vector<T>, Vector3<T>, IRectangle<T>, LineSegment<T>, BoundingBox<T>
- Math: generic math interfaces that allow arithmetic to be performed in generic code, based on math interfaces originally designed by Rüdiger Klaehn. Also includes fixed-point types, 128-bit integer math, and handy extra math functions in MathEx.
- Other utilities: message sinks (IMessageSink), object tags (ITags, HashTags), ICloneable<T>, interface adapter generation, Localization (Localize class), Symbol, threading stuff, a miniture clone of NUnit (MiniTest, RunTests), a pluggable internationalization mechanism (Localize class), EzStopwatch, WeakReference<T>, extra general-purpose exception types derived from InvalidOperationException (e.g. InvalidStateException, ReadOnlyException) and miscellaneous low-level functionality including more extension methods.
- Compatibility: a very small amount of .NET 4.5 stuff, backported to .NET 4.0 (I'm thinking about using a separate DLL for this, maybe using Theraot's libraries)
Subscribe to:
Posts (Atom)