Saturday, August 24, 2013

Protobuf-net: the unofficial manual

Protobuf-net is a fast and versatile .NET library for serialization based on Google's Protocol Buffers. It's one of those libraries that is both widely used and poorly documented, so usage information is scattered across the internet. Writing a blog post with lots of random information might not solve the situation, but it's better than nothing, right? The majority of the web documentation is in GettingStarted, and small amounts of information exist in XML doc-comments in the source code (which will appear automatically in Visual Studio IntelliSense provided that your protobuf-net.dll file has a protobuf-net.xml file beside it.) The doc-comments are rarely more than a vague hint, though, about what something does or means.

So here are some things I've learned about this library that may not be obvious. I've mentioned a bunch of things that I don't know or that I merely inferred. If you know something I don't, leave a comment!

Overview

Protobuf-net is not compatible with the standard .NET serialization system. It can be configured to behave in a fairly similar way, but one should be aware that
  • protobuf-net ignores the [Serializable] attribute
  • protobuf-net does not support custom serialization using ISerializable and a private constructor, and AFAIK it does not offer something very similar.
However, it does offer several approaches to serialization on a type-by-type basis (see next section).

Protobuf is, of course, limited by the way protocol buffers work. Protocol buffers do not define a concept of "record type" (the deserializer is supposed to know the data type in advance, so you must specify a data type at the root level when deserializing), they have only a few wire formats for data, and all fields in a protocol buffer have only numeric identifiers (called "tags"), not strings. That's why you're supposed to put a [ProtoMember(N)] attribute on every field or property that you want to be serialized.

Standard Protocol Buffers offer no way of sharing objects (using the same object in two places) or supporting reference cycles, because they are simply not designed to be a serialization mechanism. That said, protobuf-net does support references and cycles (more information below). Supporting references and cycles is more expensive in terms of time and output size, so it's disabled by default. There are at least two ways to opt-in:
  1. If you know that instances of a particular class are often shared, use [ProtoContract(AsReferenceDefault=true)]. This will apply to all fields of that type and collections that contain that type.
  2. You can opt-in to the reference system on an individual field using [ProtoMember(N, AsReference=true)]. References will be shared with any other field that also uses AsReference=true. If a certain object is placed both in fields that have AsReference=true and fields that do not, the fields where AsReference=false will hold duplicate copies of the object, and these copies will be deserialized into separate objects. I can only assume that AsReference=false will override AsReferenceDefault=true.
When you use AsReference=true on a collection type, the instances inside the collection are tracked by reference. However, it looks like the collection itself is not tracked by reference. If the same collection object is used in multiple places, it will be serialized multiple times and these copies will not be consolidated during deserialization.

Protobuf is equally comfortable serializing fields and properties, and it can serialize both public and private fields and properties (well, maybe private things won't serialize in Silverlight or partial-trust? I suspect there's a security prohibition in Silverlight). Fields and properties that do not have a [ProtoMember(N)] attribute are normally left unserialized, unless the class has the ImplicitFields option, [ProtoContract(ImplicitFields = ImplicitFields.AllPublic)] or [ProtoContract(ImplicitFields = ImplicitFields.AllFields)], which causes numbers to be assigned automatically.

I can't find documentation for how the numbers are auto-assigned, but it appears to assign numbers to your fields/properties in alphabetical order starting at 1. It will start numbering at 1 even if your class uses ProtoMember(N) explicitly on some fields. The ProtoMember attribute is not ignored, it just doesn't affect the numbering of the fields that don't use ProtoMember, and tag numbers may end up in conflict (e.g. if you use ProtoMember(1), the first auto-assigned number will still be 1, causing a conflict.)

When using ImplicitFields, protobuf-net ignores fields/properties marked [ProtoIgnore]. Just to be clear, if you don't use ImplicitFields, fields and properties are ignored by default and must be explicitly serialized with [ProtoMember(N)].

Protocol buffers don't have an inheritance concept per se, but protobuf-net supports inheritance if you specify [ProtoInclude(N, typeof(DerivedClass))] on the base class. There are multiple ways that inheritance could work. Protobuf-net's approach is to define an optional field in the base class for each possible derived class; the field number N in ProtoInclude refers to one of these optional fields. For example, the following classes:

[ProtoContract(ImplicitFields = ImplicitFields.AllFields)]
[ProtoInclude(100, typeof(Derived))]
[ProtoInclude(101, typeof(Derive2))]
class Base { int Old; }

[ProtoContract(ImplicitFields = ImplicitFields.AllFields)]
class Derived : Base { int New; }
[ProtoContract(ImplicitFields = ImplicitFields.AllFields)]
class Derive2 : Base { int Eew; }

have the following schema:

message Base {
   optional int32 Old = 1 [default = 0];
   // the following represent sub-types; at most 1 should have a value
   optional Derived Derived = 100;
   optional Derived Derive2 = 101;
}
message Derived {
   optional int32 New = 1 [default = 0];
}
message Derive2 {
   optional int32 Eew = 1 [default = 0];
}

Forms of type serialization in protobuf-net

I would say there are five fundamental kinds of [de]serialization that protobuf-net supports on a type-by-type basis (not including primitive types):
  1. Normal serialization. In this mode, a standard protocol buffer is written, with one field in the protocol buffer for each field or property that you have marked with ProtoMember, or that has been auto-selected by ImplicitFields. During deserialization, the default constructor is called by default, but this can be disabled. I heard protobuf-net lets you deserialize into readonly fields (?!?), which should allow you to handle many cases of immutable objects.
  2. Collection serialization. If protobuf-net identifies a particular data type as a collection, it is serialized using this mode. Thankfully, collection types do not need any ProtoContract or ProtoMember attributes, which means you can serialize types such as List<T> and T[] easily. (I don't know how protobuf-net will react if any such attributes are present on your "collection" class). I heard that dictionaries are supported, too.
  3. Auto-tuple serialization. Under certain conditions, protobuf-net can deserialize an immutable type that has no ProtoContract attribute by calling its non-default constructor (constructor with more than zero arguments). Luckily this is fully documented at the link. This feature automatically applies to System.Tuple<...> and KeyValuePair.
  4. String ("parsable") serialization. Protobuf-net can serialize a class that has a static Parse() method. It calls ToString() to serialize, and then Parse(string) to deserialize the string. However, this feature is disabled by default. Enable it by setting RuntimeTypeModel.Default.AllowParseableTypes = true. I don't know whether [ProtoContract] disables the feature.
  5. "Surrogate" serialization, which is very useful for "closed" types that you are not allowed to modify, such as BCL (standard library) types. Instead of serializing a type directly, you can designate a user-defined type as a surrogate using RuntimeTypeModel.Default.Add(typeof(ClosedType), false) .SetSurrogate(typeof(SurrogateType)). To convert between the original type and the surrogate type, protobuf-net looks for conversion operators on the surrogate type (public static implicit operator ClosedType, public static explicit operator SurrogateType); implicit and explicit both work. The conversion operator is invoked even if the "object" to be converted is null.
Let's talk a little about deserialization. Because in order to do that, it needs an object. There are three ways it can get an object:
  1. "Like XML serialization". In this mode, which is the default, your class must have a default constructor. Before starting to deserialize, your default (meaning zero-argument) constructor is called. However, unlike XML serialization, protobuf-net can call a private constructor.
  2. "Like standard serialization". The option [ProtoContract(SkipConstructor = true)] will deserialize without calling the constructor, like in standard serialization. Magic! If you add ImplicitFields = ImplicitFields.AllFields, protobuf-net behaves even closer to standard serialization.
  3. Apparently you can deserialize into an object that you created yourself, at least at the root level. Call the (non-static) RuntimeTypeModel.Deserialize(Stream, object, Type) method (e.g. RuntimeTypeModel.Default.Deseralize()). I don't know why it needs both an object and a Type. And perhaps it can do the same trick for sub-objects, I'm not sure. Anyway this is handy if you want to deserialize, examine, and discard several objects in a row; you can avoid creating unnecessary work for the garbage collector.
It should be noted that some techniques that protobuf-net supports are only available when using the full .NET framework in full-trust mode. As mentioned here, for example, if you're using the precompiler then you won't be able to deserialize into private readonly fields. Personally I am using the full .NET framework, and I don't know what gotchas lie in wait in other environments.

Protobuf-net collection handling

  • Protobuf-net serializes a collection using a "repeated" field (in protocol buffer lingo). Therefore, you should be able to safely change collection types between versions. For example, you can serialize a Foo[] and later deserialize it into a List.
  • I don't know exactly how protobuf-net decides whether a given type is a collection. Wild guess: it might look for the IEnumerable<T> interface and an Add method?
  • If you serialize a class that contains an empty but non-null collection, protobuf-net does not seem to distinguish an empty collection from null. When you deserialize the object, protobuf-net will leave the collection equal to null (or if your constructor created a collection, it will remain created).
  • By default, protobuf-net "appends" to a collection rather than replacing one. So if you write a default constructor (without suppressing it) that initializes an int[] array to 10 items, and then deserialize a buffer with 10 items, you'll end up with an array of 20 items. Oops. Use the [ProtoMember(N, OverwriteList=true)] option to replace the existing list (if any) instead (or use SkipConstructor).

Random facts and gotchas

  • Protobuf.net's precompiler let's you use protobuf-net on platforms where run-time code generation or reflection are not available. It can also be used on the full .NET framework to avoid some run-time work.
  • By default, protobuf-net will accept [XmlType] and [XmlElement(Order = N)] in place of [ProtoContract] and [ProtoMember(N)], which is nice if you're already using XML serialization or if you want to avoid explicitly depending on protobuf-net. Similarly, it accepts the WCF attributes [DataContract] and [DataMember(Order = N)]. The Order option is required for protobuf support.
  • [XmlInclude] and [KnownType] cannot be used in place of [ProtoInclude] because they do not have an integer parameter to use as the tag number.
  • Tag values must be positive. [ProtoMember(0)] is illegal.
  • Very useful: print the result of RuntimeTypeModel.GetSchema(typeof(Root)) to find out what protocol buffers will be used to represent your data (e.g. RuntimeTypeModel.Default.GetSchema). Note: I assume this will be the actual schema used for serialization, but the documentation says merely that it will "Suggest a .proto definition".
  • A constructor with a default argument like Constr(int x = 0) is not recognized as a parameterless constructor.
  • The SkipConstructor and ImplicitFields options are not inherited, and probably other options are not inherited either. So, for example, if you use SkipConstructor on a base class, the constructor of the derived class is still called (and, by implication, the base class constructor).
  • You may have noticed the "Visual Studio 2008 / 2010 support" download package on the download page, but what the heck is it for? I haven't tried it myself, but based on this blog entry, I suspect it's a tool for generating C# code from a .proto file automatically.
  • Sometimes protobuf-net can serialize something but not deserialize it; be sure to test both directions.
  • RuntimeTypeModel.DeepClone() is a handy way to test whether serialization and deserialization both work. This method will typically serialize the object and immediately deserialize it again.
  • I suspect you can use the standard [NonSerialized] attribute on a field or property, as an alternative to [ProtoIgnore].
  • When serializing a sub-object, protobuf-net can either write a length-prefixed buffer, or if the field that contains the sub-object has the [ProtoMember(N, DataFormat = DataFormat.Group)] option, it can use what Marc Gravell calls a "group-delimited" record, which avoids the overhead of measuring the record size in advance. Google calls this feature "groups", but it has deprecated the feature and AFAICT, removed any documentation about it that might have existed in the past.
  • In [ProtoMember], the default value of IsRequired is false. I can only assume it means the same as required in a .proto file. I'm guessing that a required field is always written and that there will be some sort of exception if a required field is missing during deserialization.
  • Protobuf-net supports Nullable<T>. I heard that a value of type int? or any other nullable type will simply not be written to the stream if it is null (therefore, I have no idea what happens if you use IsRequired=true on a nullable field--and that includes a reference-typed field.)
  • Protobuf-net will (reasonably) refuse to serialize a property that does not have a setter, saying "Unable to apply changes to property". It will, however, serialize a property whose setter is private, and call that setter during deserialization.
  • If you download the source code of protobuf-net via subversion, you'll have access to the Examples project, which demonstrates various ways to use the library.

Serializing without attributes

If you can't change an existing class to add [ProtoContract] and [ProtoMember] attributes, you can still use protobuf-net with that class. But before I tell you how, it should be mentioned that protobuf-net's configuration is stored in a class called RuntimeTypeModel (in the Protobuf.Meta namespace). There is one global model, RuntimeTypeModel.Default, and you can create additional models with the static method TypeModel.Create(). This makes it possible to serialize the same class in different ways, using different protocol buffers in the same program.

Suppose you have a RuntimeTypeModel object called model. Then model.Add(typeof(C), true) creates a configuration for type C, represented by a MetaType object. You can also call model[typeof(C)] to get or create a MetaType, although I'm not sure what the relationship is between model[type] and model.Add(type, flag) even after decompiling both of them.
  • Call model.Add(typeof(C), false).SetSurrogate(typeof(S)) to establish S as a substitute for C during serialization. If a surrogate is used, all other options for C are ignored (the options for S are used instead). If not using a surrogate, you should probably use model.Add(typeof(C), true) instead although I'm uncertain what the true flag actually does, whether it's equivalent to [ProtoContract] or does something else.
  • model[type].Add(7, "Foo").Add(5, "Bar") is equivalent to the attributes [ProtoMember(7)] on the field/property Foo, and [ProtoMember(5)] on the field/property Bar.
  • model[type].Add("Fizz", "Buzz", ...) assigns tag numbers sequentially starting from 1 or, if some tag numbers exist already, from the highest existing tag number plus one. So if the type has no fields defined yet, Fizz will be #1 and Buzz will be #2.
  • model[type].AddSubType(100, typeof(Derived)) is equivalent to [ProtoInclude(100, typeof(Derived))]. Example here.
Finally, call model.Serialize(Stream, object) or model.Deserialize(Stream, object, Type) (or another overload).

Data types

The following types illustrate how protobuf-net maps primitive types to protocol buffers:

C#
[ProtoContract]
class DefaultRepresentations
{
  [ProtoMember(1)] int Int;
  [ProtoMember(2)] uint Uint;
  [ProtoMember(3)] byte Byte;
  [ProtoMember(4)] sbyte Sbyte;
  [ProtoMember(5)] ushort Ushort;
  [ProtoMember(6)] short Short;
  [ProtoMember(7)] long Long;
  [ProtoMember(8)] ulong Ulong;
  [ProtoMember(9)] float Float;
  [ProtoMember(10)] double Double;
  [ProtoMember(11)] decimal Decimal;
  [ProtoMember(12)] bool Bool;
  [ProtoMember(13)] string String;
  [ProtoMember(14)] DayOfWeek Enum;
  [ProtoMember(15)] byte[] Bytes;
  [ProtoMember(16)] string[] Strings;
  [ProtoMember(17)] char Char;
}
.proto
message DefaultRepresentations {
  optional int32 Int = 1 [default = 0];
  optional uint32 Uint = 2 [default = 0];
  optional uint32 Byte = 3 [default = 0];
  optional int32 Sbyte = 4 [default = 0];
  optional uint32 Ushort = 5 [default = 0];
  optional int32 Short = 6 [default = 0];
  optional int64 Long = 7 [default = 0];
  optional uint64 Ulong = 8 [default = 0];
  optional float Float = 9 [default = 0];
  optional double Double = 10 [default = 0];
  optional bcl.Decimal Decimal = 11 [default=0];
  optional bool Bool = 12 [default = false];
  optional string String = 13;
  optional DayOfWeek Enum = 14 [default=Sunday];
  optional bytes Bytes = 15;
  repeated string Strings = 16;
  optional uint32 Char = 17 [default = 

(there's a bug in GetSchema; the output is truncated after a field of type char.)

You can look at the protocol buffer documentation, particularly Encoding, for more information. All of the integer types use the varint wire format. Because protobuf-net uses int32/int64 rather than sint32/sint64, negative numbers are stored inefficiently, but you can use the DataFormat option to choose a different representation, as shown below. sint32/sint64 (called ZigZag in protobuf-net) are better for fields that are often negative; sfixed32/sfixed64 (FixedSize in protobuf-net) are better for numbers have large magnitude most of the time (or if you just prefer a simpler storage representation).

By the way, string and byte[] ("bytes" in the .proto) both use length-prefixed notation. Sub-objects (aka "messages") also use length-prefixed notation (is this documented anywhere?) unless you use the "group" format.

C#
[ProtoContract]
class ExplicitRepresentations
{
  [ProtoMember(1, DataFormat = DataFormat.TwosComplement)] int defaultInt;
  [ProtoMember(2, DataFormat = DataFormat.TwosComplement)] int defaultLong;
  [ProtoMember(3, DataFormat = DataFormat.FixedSize)] int fixedSizeInt;
  [ProtoMember(4, DataFormat = DataFormat.FixedSize)] long fixedSizeLong;
  [ProtoMember(5, DataFormat = DataFormat.ZigZag)] int zigZagInt;
  [ProtoMember(6, DataFormat = DataFormat.ZigZag)] long zigZagLong;
  [ProtoMember(7, DataFormat = DataFormat.Default)] SubObject lengthPrefixedObject;
  [ProtoMember(8, DataFormat = DataFormat.Group)] SubObject groupObject;
}
[ProtoContract(ImplicitFields=ImplicitFields.AllFields)]
class SubObject { string x; }

.proto
message ExplicitRepresentations {
   optional int32 defaultInt = 1 [default = 0];
   optional int32 defaultLong = 2 [default = 0];
   optional sfixed32 fixedSizeInt = 3 [default = 0];
   optional sfixed64 fixedSizeLong = 4 [default = 0];
   optional sint32 zigZagInt = 5 [default = 0];
   optional sint64 zigZagLong = 6 [default = 0];
   optional SubObject lengthPrefixedObject = 7;
   optional group SubObject groupObject = 8;
}
message SubObject {
   optional string x = 1 [default = 0];
}

By the way, since Google doesn't seem to document the "group" format, I'll show you how the two sub-object formats look in binary:

[ProtoContract]
class SubMessageRepresentations
{
  [ProtoMember(5, DataFormat = DataFormat.Default)] 
  public SubObject lengthPrefixedObject;
  [ProtoMember(6, DataFormat = DataFormat.Group)]
  public SubObject groupObject;
}
[ProtoContract(ImplicitFields=ImplicitFields.AllFields)]
class SubObject { public int x; }
/*
message SubMessageRepresentations {
   optional SubObject lengthPrefixedObject = 5;
   optional group SubObject groupObject = 6;
}
message SubObject {
   optional int32 x = 1 [default = 0];
}
*/

using (var stream = new MemoryStream()) {
  _pbModel.Serialize(
    stream, new SubMessageRepresentations {
      lengthPrefixedObject = new SubObject { x = 0x22 },
      groupObject = new SubObject { x = 0x44 }
    });
  byte[] buf = stream.GetBuffer();
  for (int i = 0; i < stream.Length; i++)
    Console.Write("{0:X2} ", buf[i]);
}
// Output: 2A 02 08 22 33 08 44 34 

Vesioning

  • You can safely remove a serialized field (or a derived class) between versions of your software. Protobuf-net simply silently drops fields that are in the data stream but not in the class. Just be careful not to use the removed tag number again.
  • You must not change the value of AsReference (or AsReferenceDefault) between versions.
  • It's handy to save the results of GetSchema (mentioned above) to keep track of your old versions. You can then "diff" two schema versions to spot potential incompatibilities.
  • Don't change the storage representation of an integer between versions; your numbers will be silently messed up if you migrate between TwosComplement and ZigZag. In theory I think protobuf-net could handle a change from FixedSize to TwosComplement or vice-versa, but I don't know if it actually can. Similarly, a change from FixedSize int to FixedSize long could work in theory, but I don't know about practice.
  • On the other hand, you can increase the size of a non-FixedSize integer, e.g. byte to short or int to long, since the wire format is unchanged.
  • The Google docs have more information relevant to versioning.

How references work

When serializing a class Foo to which AsReference or AsReferenceDefault applies, the type of the field in the protocol buffer changes from Foo to bcl.NetObjectProxy, which is defined as follows in the source code of protobuf-net (Tools/bcl.proto):
message NetObjectProxy {
  // for a tracked object, the key of the **first** 
  // time this object was seen
  optional int32 existingObjectKey = 1; 
  // for a tracked object, a **new** key, the first 
  // time this object is seen
  optional int32 newObjectKey = 2;
  // for dynamic typing, the key of the **first** time 
  // this type was seen
  optional int32 existingTypeKey = 3; 
  // for dynamic typing, a **new** key, the first time 
  // this type is seen
  optional int32 newTypeKey = 4;
  // for dynamic typing, the name of the type (only 
  // present along with newTypeKey) 
  optional string typeName = 8;
  // the new string/value (only present along with 
  // newObjectKey)
  optional bytes payload = 10;
}
So it appears that

  • The first time an object is encountered, the newObjectKey and a payload fields are written; presumably, the payload is stored as if its type is Foo.
  • When the object is encountered again, just the existingObjectKey is written.
I don't know what that "dynamic typing" business is about.

Strings, of course, are a reference type. Read here about how protobuf-net handles strings.

More stuff I don't know yet

  • I don't know if protobuf-net is capable of deserializing a field of type object. By default, it can't.
  • I don't know how protobuf-net deserializes fields of an interface type, e.g. IEnumerable<T> or IFoo. (Serializing is easy, of course, it just uses GetType() to learn the type.)
  • I don't know whether the root object is allowed to be a collection.
  • I don't know how to run "prep" code before serialization of a particular type, or validation/cleanup code after an object is deserialized, but I do know that some sort of "callback" mechanism exists for this purpose.

Other sources of information: