Friday, May 21, 2010

New features the .NET framework should have

I recently wrote a class called CPTrie that stores a sorted collection of strings or integers in less space than a Dictionary or SortedDictionary. It took a long time to develop this data structure in .NET while minimizing memory and CPU usage. I am fairly convinced that without some of .NET's restrictions, I could have made an even more compact data structure.

Allow me to suggest a few features that would make .NET better by allowing it to do certain tasks more efficiently, or by giving coders more flexible design options.

It is already easier to write efficient code in .NET than Java, of course, thanks to features like structs (value types), Generics and compiled expression trees. But it could be better.

Slices

I love the type safety that .NET provides, but people seem to forget that type safety doesn't have to be a straightjacket. They think, for instance, that pointers automatically kill type safety, apparently forgetting that "references" are nothing but neutered pointers. Pointers aren't inherently dangerous, rather it is certain operations that are dangerous, like reading or writing past the end of an array. If your language can prevent you from doing dangerous things with pointers then they suddenly become safe.

Google Go recognizes this, and provides a safe form of pointer called a slice. A slice is a pointer to an array that stores the array size with the pointer. This is not the same as a .NET array reference, which always points to the beginning of the array. A slice can point to a subset of an array, and if .NET had slices, all functions of the form
int Sum(int[] array, int startIndex, int endIndex) {...}

could be replaced simply with
int Sum(int[..] array) {...}

And you would probably call the function with a nice syntax like
int result = Sum(array[startIndex..endIndex]);
// or Sum(array) to pass the whole array
Plus, these functions often come with a "convenience overload" to process the entire array...
void Foo(int[] array) { Foo(array, 0, array.Length); }
...which would not be necessary with slices.

Slices can be more efficient than the traditional triplet (array, start, end) for three reasons:
  1. A slice is two values (pointer + length). Passing it around is faster than passing around three values.

  2. Triplets use 50% more memory (relevant if you need to store many of them)

  3. All functions that access an array through a triplet must check the array index on every access. For instance, in today's .NET, Sum() might be written like this:
    void Sum(int[] array, int startIndex, int endIndex)
    {
    // I avoid using the "less than" sign because Blogger is still
    // terribly buggy after several years. Lazy bums.
    if (0>startIndex || startIndex>endIndex || endindex>array.Length)
    throw new ArgumentException();

    int total = 0;
    for (int i = startIndex; array.Length > i; i++)
    total += array[i];
    return total;
    }

    With slices, the code is shorter and the JIT optimizer could do a better job:
    void Sum(int[..] slice)
    {
    int total = 0;
    for (int i = 0; slice.Length > i; i++)
    total += slice[i];
    return total;
    }

    Microsoft .NET can detect that a for loop goes from 0 to the array length, and optimize out the range check. It should be easy to do the same for slices, which could make a method like this significantly faster, yet still perfectly type-safe. Meanwhile, the JIT is unable to optimize the first loop (the one that uses an array) because it is too complex, and there may be additional overhead because the JIT can only hold 2 or 3 variables in registers.

Slices would be even more useful if you could cast them in type-safe ways: for instance you should be able to cast a int[4] slice to a short[8] slice, and you should be able to cast a short[7] to an int[3] (there is the possibility of an alignment exception in the latter case on certain processors, but it is still "safe" in the sense that memory corruption is impossible). Slice casts would be extremely useful for parsing various binary data blobs (e.g. binary files, network packets), which are difficult to deal with in .NET right now.

Such casts are perfectly safe, provided that casts involving reference types are prohibited.

There's one disadvantage of allowing the slice cast: the optimizer would have to assume that a value-type slice may point to the same physical memory as a different kind of slice or array. But C has the same problem and it still wins the speed race.

One roadblock to implementing slices in .NET may be a challenge to the garbage collector. In .NET today all references point to the beginning of an object. Slices can point to the middle of an object, which means that the garbage collector needs a way to find the beginning of an object given a pointer to the middle. If this is not possible, slices will have to be expanded to a triplet: a third value is required, pointing to the beginning of the object being kept alive by the slice.

A slice in triplet form would still have a performance advantage over normal (array, index, length) triplets, because while the pointer to the beginning of the object would have to be passed around, it never has to be accessed except by the garbage collector. On x86 this means it can be held only on the stack and not in a register, saving scarce registers for other things.

Programmers should be able to get the slice for an entire array given a slice of just part of the array. However, if this is allowed, note the security implication: anyone returning a slice to a partial array must be aware that the receiver can get access to the entire array.

Slices that point to null are also possible. .NET would have to guarantee that null slices always have a Length of 0. Note that null slices are advantageous over null array references, because you can access the Length field of a slice without risking a NullReferenceException, so you need not write code to handle both "null" slices and "zero-length" slices.

Fixed-size arrays

In some cases it would be really nice to have zero-overhead fixed-size arrays. For instance, I have seen people representing points like this:
class Point {
private float[] _coords = new float[3];
public float X { get { return _coords[0]; } set { _coords[0] = value; } }
public float Y { get { return _coords[1]; } set { _coords[1] = value; } }
public float Z { get { return _coords[2]; } set { _coords[2] = value; } }
public float this[int n] {
get { return _coords[n]; } set { _coords[n] = value; }
}
}
This is very inefficient if you need to keep track of large numbers of points, since the 12 bytes of coordinate data is accompanied by 20 bytes of overhead: 8 bytes for the class and 12 bytes for the array. Moreover, all access to the X, Y and Z properties is guarded by an array range check, even though the subscripts and array size are constants. Only the indexer should require a range check.

To solve these problems, .NET should offer "safe" fixed-size arrays that hold the array directly in the container class or struct. This would save 12 bytes of overhead and allow faster access to the array, as well as faster object construction.

You can already use fixed-size arrays in .NET, but they are considered "unsafe" and have the following disadvantages:
  • They require you to sprinkle the "unsafe" keyword all over the place.
  • You have to set a compiler option to allow "unsafe" code.
  • Your code cannot run in secure environments, such as inside a web browser.
  • Fixed-size arrays cannot hold references to classes, or structs (only primitives are allowed).

Provided that there are array index range checks, there is really nothing unsafe about a fixed-size array, and no reason they can't hold references or structs. The reason for these silly restrictions (I think) is that the CLR doesn't have built-in support for fixed-size arrays, so C# resorts to pointer arithmetic, which is unverifiable to the CLR (and in that sense "unsafe").

Fixed-size arrays would work very nicely with slices, so that you could offer access to the array via a property:
class Point {
private fixed float _coords[3];
...
public float[..] Coords { get { return _coords; } }
}
However, this would only be legal inside classes. A struct could not return a slice of itself, since a struct could cease to exist before the slice does.

Go interfaces

In Go, classes do not have to explicitly say what interfaces they implement, and anybody that notices a pattern in the methods of two otherwise unrelated classes can declare that they both implement a certain interface.

Go's interfaces provide compile-time type checking, and since Go is built heavily around these "implicit" interfaces, I assume Google found a way to give them high performance similar to .NET interfaces. As such I think Microsoft would be smart to consider adding support for Go-style interfaces, which in turn would allow a fast implementation of the Go language on the CLR.

Go-style interfaces would allow developers to work around limitations in the way Microsoft designed interfaces. For instance, there are a lot of read-only collection classes (or collections like Stack where modifications are constrained), but Microsoft didn't bother to make a IReadOnlyCollection interface, so we're stuck writing four unsupported methods (Add, Clear, Remove, CopyTo) in addition to the three that matter (GetEnumerator, Count, Contains).

With Go interfaces, you can define your own private IReadOnlyCollection interface which all the BCL collections, and any collections made by third parties, automatically implement.

(I am not, of course, suggesting that traditional interfaces be abandoned. They are too well entrenched and, I have learned, are faster in some cases anyhow.)

Update: it turns out this feature would be insufficient to support Go. In Go you can define new methods for classes defined elsewhere (or for primitive types, pointers, etc.) These could be compared to extension methods, but Go is not object-oriented and it's not really the same. In Go, as I understand it, an interface can bundle together methods defined in multiple modules that happen to operate on the same data type. I don't understand how it works - I haven't been able to find information on their dispatch mechanism. Anyway, the point is, if MS really wanted to support Go on .NET they'd have to study it more carefully than I have.

Go interfaces are comparable to Visual Basic 9's new "dynamic interfaces", except that in Go you get a error when you try to cast an object to an interface that the object does not support. In VB an exception occurs when you call an unsupported method on the interface, not during the cast.

Return type covariance

If class Circle is derived from class Shape, then the following code is perfectly safe and should be legal:

abstract class ShapeFactory {
abstract public Shape New();
}
abstract class CircleFactory {
override public Circle New() { ... }
}
I have encountered many situations where covariance would be useful, and it pains me to have to find a different design or implement a workaround just because Microsoft won't add this trivial feature.

Thread-local variable inheritance

Just a small pet peeve. When you create a child thread, it loses all the values of thread-local variables from the parent. There should be an option to copy values from the parent thread. I heard a rumor that somebody has a patent on this trivial idea, though.

Update: My own Ambient Service Pattern cannot work reliably without this feature.

Microsoft doesn't actually have to infringe that ridiculous patent to support the feature, either: they can simply (1) fire an event on each child thread when it starts, while blocking the parent thread, and (2) provide a way for the child thread to access thread-local variables of the parent thread. This way, Microsoft is merely giving developers the tools they need to make it possible to infringe the patent. Possible is a big improvement over impossible, friends.

Binary-comparing two value types

Another trivial feature that Microsoft left out of .NET is the ability to bitwise-compare two values. While developing the VList data structures for .NET, I really wanted this feature. There's no reason the 'ceq' opcode shouldn't support it, and as far as I can tell the ECMA-335 standard does not state whether 'ceq' can be used on arbitrary value types. However, C# doesn't allow it so I assume it's not supported.

It's true that bitwise equality isn't "perfect". For instance, floating point has a "positive zero" and a "negative zero" that compare equal, but are bitwise unequal. However, if two value types ARE bitwise equal then they are the same for all practical purposes, and it's hard to think why you would waste time calling Equals() (which, in structs, uses reflection by default!)

In the case of reference types you can call object.ReferenceEquals to find out if two references are the same, but what if you are writing generic code that handles both values and references? In that case you cannot use object.ReferenceEquals, which is why a bitwise comparison operator would be so useful.

A concrete example of this occurs in my VList code, which has a "Smart Select" method:

public FVList<T> SmartSelect(Func<T, T> map)
This method of FVList passes each element of the FVList to the map() function. If, for every element of the list, map() returns the same value as it was given, SmartSelect() can simply return the original list instead of a new one. The ability to do this is very important for efficient functional programming; unfortunately, .NET screws it up by making it difficult to tell whether map() returned its argument unchanged or not. We have to use virtual function calls, and maybe even reflection, to tell whether the output is the same as the input. That's stupid.

Integers in pointers

Okay, what I'm about to suggest may sound crazy. You have been warned.

In .NET, all references have 0s in the bottom two bits of the pointer. This is also true for C++ pointers to heap objects, and the Ruby language interpreter and the Judy array library use this fact to store a kind of safe union in a pointer:
union {
Object* pointer;
int integer;
}
If the two lowest bits are 0, the value is a pointer, otherwise it is an integer of some kind. A similar technique is to use the lowest bit to discriminate a union between two pointer types:
union {
Object0* pointer0; // if low bit is 0
Object1* pointer1; // if low bit is 1
}
While abusing pointers like this in C is dangerous (since you could easily dereference an integer by mistake, etc.), formalizing the technique can make it safe (just as the Ruby language is safe from memory corruption despite using this kind of union extensively).

In C#, the second union would be almost pointless since objects are self-typed, but it might be useful instead if you could use the low bit of a pointer as a boolean flag. For instance, .NET's SortedDictionary uses the following node structure:
internal class Node
{
private bool isRed;
private T item;
private Node left;
private Node right;
}
There's something to be said for a straightforward implementation like this, but if somebody's using this class extensively with a large data set, they might appreciate the memory savings (up to 17%) that would come from hiding the "isRed" flag inside the 'left' or 'right' reference.

Of course, this technique is pretty obscure, and is rarely needed, yet occasionally it is the only efficient technique to do the job. In Ruby this feature is key to the type system that unifies integers and objects. In Ruby "everything is an object", but storing integers on the heap is very inefficient. Although all Ruby variables are implemented as pointers, if you store a small integer in a variable, rather than allocating a heap object, Ruby just encodes the integer right into the pointer. Can you think of an approach that is remotely as efficient? Similarly, the Judy array's ability to store data compactly is heavily reliant on the fact that it can change the meaning of a "pointer" based on its low bits. I wanted to port Judy to .NET, and struggled for quite some time before concluding that approximating Judy is impossible in .NET.

To minimize changes to the CLR, I would propose a single pointer union be defined, in which the 2 low bits are given the following meanings:
  • 00: A pointer with a "false flag"
  • 10: A pointer with a "true flag"
  • 01 and 11: An integer (shift right 1 to get its value)
To store these values, the .NET BCL would define a special structure:
struct PointerOrInteger<T> where T:class
{
object _ptr; // not directly accessible
bool IsInteger { get; } // returns ((int)_ptr & 1) != 0
T Pointer { get; set; } // getter returns IsInteger ? null : _ptr & ~3
IntPtr Integer { get; set; } // getter returns IsInteger ? (IntPtr)_ptr >> 1 : 0
int Int32 { get; set; } // getter returns IsInteger ? (int) _ptr >> 1 : 0
bool Flag { get; set; } // getter returns (_ptr & 2) != 0
}

The garbage collector could use reflection/metadata to detect the presence of PointerOrInteger, in order to single it out for special treatment. However, it might be simpler if the GC assumed any pointer could be PointerOrInteger and therefore tested low two bits of every pointer before following it. If the two bits are 0, it can be followed normally; otherwise it is presumably a PointerOrInteger and the pointer must be retrieved through the Pointer property.

The disadvantage of this feature would be that the garbage collector would have to be aware of these special pointer types; code to deal with them might slow down the GC slightly, even in code that doesn't use the special pointers. However, this feature could probably be added without modifying CIL or the JIT, as the code to manipulate them could be neatly tucked away in PointerOrInteger.

SIMD Vector types

Microsoft ought to copy SIMD support from Mono. It's all explained here.

New features C# should have

The above features would need CLR changes, but many useful features need only language support.

Default interface implementations

Interfaces should be able to have default implementations. For instance, ICollection.Contains(T) could have a default implementation like this:

bool Contains(T item) default {
var c = EqualityComparer<T>.Default;
foreach(T t in this)
if (item.Equals(t))
return true;
return false;
}
Half the time this is exactly how one implements the Contains() method. Why require every collection author to write the same code? Of course, traits could also help us avoid code duplication and would be more powerful, so let me stick that on my list:

Traits

Read about them in this PDF.

Units

The traditional type system (roughly speaking) classifies data by its structure: an array holds data in a certain way; a linked list holds data in a different way. There is an orthogonal concept of type, the "unit" type, which says something about the data itself. For example, an integer may contain a quantity measured in bytes, kilobytes, kilometres, sectors or pixels. A double has a different physical structure than an int, but it could have the same unit type.

It is safe to add or subtract two values as long as they have the same units. I can add 10 bytes to 5 bytes and get 15 bytes, but if I add 5 bytes to 10 DWORDs there is probably an error in my code. Every engineer and scientist knows that if you multiply "2 kg" times "2 m/s^2", the result is "4 N", not "4 lbf" and code that computes the latter result might have destroyed the Mars Climate Orbiter.

A unit type checker like the one I wrote for boo would semi-automatically detect such errors and emit a warning message. Unit checking would work best if it had some support from the Microsoft BCL (Base Class Library). That's because a unit checker needs to track units as they pass through the BCL. For instance, if I write "Math.Abs(-7`pixels`), the unit checker has to be told that the return value of Math.Abs() has the same units as its argument. It would be nice if the BCL had attributes on its methods to provide this kind of information.

Tuples

C# should contain language support for tuples and "don't care" return values:

Tuple<int,int> DivMod(int n, int d)
{
return (n/d, n%d);
}
void Example()
{
int div, mod, five;
(div, mod) = DivMod(10, 3);
(five, _) = DivMod(25, 5); // "_" means "Don't care" (provided "_" is undefined)
}

Note that Tuple classes come with .NET Framework 4.0. For unpacking tuples, the compiler should not require the official Tuple class; instead it should allow any data type that has properties named Item1, Item2, etc. This would allow small tuples to be returned more efficiently in a structure:
struct Pair<A,B> {
public A Item1 { get; set; }
public B Item2 { get; set; }
}
Pair<int,int> DivMod(int n, int d)
{
return new Pair<int,int> { Item1=n/d, Item2=n%d };
}
void Example()
{
int div, mod;
(div, mod) = DivMod(10, 3);
}

Update: there are various other ways this idea could work, too. Some have proposed extending the concept of anonymous classes so they can be returned from methods, but this approach is problematic when independent DLLs are involved. Suppose one DLL return a (string Key, string Value) pair and an independent DLL accepts a pair with exactly the same signature. You should be able to pass the tuple directly from one DLL to the other, but that wouldn't work in practice because .NET doesn't support any kind of "structural typing".

For that reason I think an approach based primarily on the standard Tuple type would be best, with support for user-defined types when they are needed.

Inline variable declarations

The tuple feature would be more elegant if C# allowed variables to be declared where they are first used:
void Example()
{
(var div, var mod) = DivMod(10, 3);
}

When I first wrote this article, I said I would like to be able to just create variables implicitly like in boo:
int ReadNumber()
{
do { line = Console.ReadLine(); }
while (!int.TryParse(line, out value));
return value;
}

No doubt, however, this feature would create even more uproar than the "var" statement did. I think, though, that syntax like "int.TryParse(line, out var value)" is a reasonable compromise.

I have another idea for making code more concise and easier to optimize, namely, a ":" or ":=" operator to create "cached" values. You see, it's really easy to write code that repeats work unnecessarily:
int GetFoo(object key, int seed)
{
if (BTree[key].Record != null)
return BTree[key].Record.GetFoo(seed);
else
return seed;
}

Sure, the expression BTree[key].Record might be expensive to compute twice, but you'd have to do extra work to refactor it into a special local variable. Nobody wants that! So many/most developers will take the performance hit, and not worry unless it starts to get slow on their 3 GHz workstation with 8 GB of RAM. Too bad half of the users are on netbooks.

I strongly believe C# should make it easy to write fast code. This kind of case is the most common optimization I can think of that cannot be automated: only the developer knows if a result is safe to cache. So I propose this syntax for caching values:
int GetFoo(object key, int seed)
{
if (r:BTree[key].Record != null)
return r.GetFoo(seed);
else
return seed;
}

The "x:y" operator would simply be a more compact form of "var x = y" with a higher precedence than most other operators, to make it easy to use inside expressions.
Regrettably, "x:" cannot be allowed at the beginning of a statement because it could be mistaken for a goto-label (you'd have to use var x = y in that case.)

I chose this syntax because it minimizes clutter. It looks good without any whitespace around the colon, which is important to convey the idea that the operator has high precedence. High precedence, in turn, is important to reduce the number of cases where you need parenthesis, since parenthesis increase typing effort and reduce readability.

When used in an if-statement or loop expression, it's hard to say if the scope of the new variable (or value--maybe it should be read-only?) should extend outside the if-statement or loop expression. Perhaps if the user writes "var x = y" the scope should be limited to the loop (because the for and foreach loops already work that way), but if the user writes "x:y" then the scope should extend to the containing block. After all, there are many cases where you might want to use the cached value later on.

Cω Streams

This I'll make into a separate post.

3 comments:

Kocsis Dávid said...

You've written some great features, but I thinking about fixed-size arrays and slices a little differetly:

I would call them range, and its type would be a fixed array. e.g. 0..10 would be int[11] (if the length is specified then its a fixed-size arrays). And when you use them as indices it would same as yours. But you would be also able to do this:

int[4] NewArray = Array[[0, 2, 1, 3]]

And Array could be int[], int[4], int[5], etc. It does the same. The changing the order of an array would be like this:

Array[0 .. Array.Length - 1] = Array[Array.Legth - 1 .. 0]
Array = Array[Array.Legth - 1 .. 0]

But If you would use the latter it would allocates a new array.

Qwertie said...

I honestly do not know what you are talking about.

Qwertie said...

Hmm, I guess you are saying that you would like the number "4" to be a part of the data type like in C. But surely situations where you want a variable-sized array are far more common than situations where you want a fixed-sized array.

My post already suggests a fixed-sized array feature, but it's intended more as an optimization hint than as a first-class data type.