ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinIntroducing CODEF/CML

Overload Journal #74 - Aug 2006 + Programming Topics   Author: Fernando Cacciola
This article introduces a C# serialization facility that proposes a novel separation between object models, representing the serialized data, and class descriptors, representing the binding to the receiving design space.

I seldom find myself being completely comfortable with a development framework. I just can't help myself seeing a weakness here and there, and not even the .Net framework escapes from this criticism. On the other hand, I have used and sweated many different frameworks over the years, from Win32 in Windows 3.1, to DCOM and CORBA, and from that experience I have to admit that .Net gets its job done remarkably well. Being a long term C++ programmer, I do find many areas that I would have liked to have been different, but then I recall that there are ASP.Net, VB, Java and C# programmers out there that wouldn't agree with me on pretty much any of those points. Admittedly, the .Net frameworks allows all of us (programmers with different backgrounds, experiences and mindsets) to work together in an incredibly productive way; something I've never before had the opportunity to enjoy.

The lingua franca that .Net represents in a heterogeneous team got me to learn and even appreciate C#, which I've been using for the last two years.

I've also got to learn and appreciate many of the .Net subsystems, like Reflection, GDI+ (which is a royal pleasure for those with a Win32 GDI experience) and particularly .Net serialization (from framework version 1.1)

.Net 1.1 serialization just worked for us out of the box without trouble until we got a complaint from the boss because files were unacceptably large. We started to work around that problem from within the framework but in the process we discovered how .Net Reflection was not being used to the extent it could, so we ended up writing our own replacement framework instead, which has been used in production code for more than a year now.

.Net serialization 101:

The .Net framework provides two serialization facilities. They are significantly different and serve different purposes.

One is called XML Serialization and it's mainly targeted at representing XML Documents as .Net objects. In this form of serialization only public fields are serialized, and the type information included in the XML file, if any, is driven by the target specification, such as SOAP, instead of the actual types of the objects being serialized.

The other is called just Serialization and it's mainly targeted at object persistence, that is, saving and loading an object graph out of core memory.

XML Serialization is not suited for persistence of application objects unless the types of these objects are supported in the target schema. Therefore, in this article, I will always refer to the second form of serialization and I will use the term to refer to the complete round trip process; that is, including deserialization.

.Net serialization is controlled at the highest level by a Formatter object. Via a Formatter object you can Serialize and Deserialize an object graph into and from a Stream. The framework provides two of formatters BinaryFormatter and SoapFormatter. The first stores binary data in the Stream and the second stores SOAP-encoded XML Elements. The SoapFormatter is similar in effect to XMLSerialization but is not exactly the same.

Only those classes marked as [Serializable] are actually serialized when passed to a formatter.

If all you do is mark a class like that, all of its fields are automatically serialized without any further intervention. This is extremely programmer friendly, but like most magical things when it doesn't work it really doesn't. In our case, the problem was the size of the files, it was just way too big for us.

The logical solution was obvious: in theory, not all of the data members need to be saved, and in our case that was particularly true: our objects are geometric figures and they cache a lot of information like bounding boxes, polygonal approximations, lengths, areas, etc. none of them need to be saved since they can be recomputed on load.

Well, it turns out that you can mark fields with the [NonSerialized] attribute which prevents them from being saved and loaded. However, the deserialization process simply ignores them so this attribute alone is not enough if those fields are dependent, that is, their values must be computed after the other fields have been loaded. In that case, you must also implement the IDeserializationCallback interface which defines the method OnDeserialization called for each deserialized object after the complete object graph has been deserialized:

[Serializable] class Circle : IDeserializationCallback
{
  void OnDeserialization( object Sender )
  {
    m_area = Math.PI * m_radius * m_radius;
  }
  double m_center_x, m_center_y, m_radius;
  [NonSerialized] m_area;
}
    

If you make any change to a [Serializable] class and the Formatter finds a mismatch between the current class fields and the saved fields it will throw a SerializationException, even if the mismatch is just a field removed in the current class. Unfortunately, classes often change after they start being serialized. When that happens, you just need to ask .Net to hand you total control of the process.

When a [Serializable] class implements the ISerializable interface it makes itself completely responsible for the serialization/deserialization process. It is totally up to you to match the data saved and loaded with the object's state. This method allows you (and requires you) to fill in a dictionary called SerializationInfo, which is what the Formatter actually stores in the Stream as a representation for your object. You still need to mark the class as [Serializable] though because interfaces don't define constructors and the ISerializable interface only defines a method used on save but doesn't provide the deserialization counterpart, a constructor that takes a SerializationInfo dictionary to restore the object state from the loaded data. See Listing 1: Implementing the Iserializable interface.

[Serializable] class Circle : ISerializable
{
  private Circle( SerializationInfo aInfo, StreamingContext aContext)
  {
    m_center_x = aInfo.GetDouble( "m_center_x" );
    m_center_y = aInfo.GetDouble( "m_center_y" );
    m_radius   = aInfo.GetDouble( "m_radius" );
    m_area = Math.PI * m_radius * m_radius;
  }
  
  public void GetObjectData( SerializationInfo aInfo, StreamingContext aContext )
  {
     aInfo.AddValue( "m_center_x", m_center_x );
     aInfo.AddValue( "m_center_y", m_center_y );
     aInfo.AddValue( "m_radius"  , m_radius   );
  }
  
  double m_center_x, m_center_y, m_radius, m_area;
}
  };
  
Listing 1

There are other low-level facilities in .Net serialization that won't be discussed in this article, like SerializationBinder objects that allow you to instruct the Formatter to map a saved type to its current counterpart, or SerializationSurrogate objects that you can use to serialize closed third-party types.

.Net serialization version 1.1 almost worked for us, but its weakness was that it offered two opposing extremes: total automation with no control at all with [Serializable] alone, or complete control with no automation at all with the interfaces and the helper objects. We felt like Reflection could be used to provide something in between that mixes automation and control.

After we invented our own framework, .Net 2.0 was released and .Net serialization was extended precisely to better use Reflection to give you some control without losing automation. The additions in .Net 2.0 are:

The data member attribute [OptionalField] which instructs the Formatter not to throw if this member is missing in a saved file.

And the method attributes [OnDeserialized], [OnDeserializing], [OnSerialize] and [OnSerializing] which let you hook on the 4 stages of the process and change your object's state if necessary.

However, we believe that the concepts and mechanisms developed in our framework are worth describing even with the .Net 2.0 Serialization extensions available.

The main reason why we needed to implement ISerializable was to control which data members to serialize in order to reduce file size. Eventually we realized that the .Net serializer was using Reflection to detect [Serializable] classes and to automatically read the object fields, save them, then read and set them back to a newly created object. But reflection can be used even further to mark, in the code, which data members to serialize, so we created CODEF/CML as a replacement for .Net serialization.

All objects have a value (or state if you like), and two objects which are not equal are equivalent if they have the same value (or state).

Serialization can be viewed as the process of transferring the value of an object into another, where transferring here necessarily involves getting the value of an object, storing it into a medium, then extracting the value out of the medium, and setting the value into the receiving object (via initialization or assignment).

Under this view, copy-construction and assignment is not a form of serialization because the value is transfered directly and not indirectly through an external medium. On the other hand, saving/loading objects to a file, transmitting them across a boundary, and even cloning an object indirectly stepping through an external medium are all forms of serialization.

From that characterization, serialization can be considered as the composition of two layered processes. On the bottom layer there is the process of getting the value out of one object and setting the value into another object. On the top layer there is the process of storing the value of an object into an external medium and extracting that value back out of the medium. Such a decomposition is useful because it decouples the get/set step (bottom layer) from the store/extract step (top layer), allowing the top layer to be provided by different agents, like one storing/extracting values to and from a file and another transmitting/receiving values over a channel.

This decomposition implies that values are themselves objects, so the bottom layer can be seen as a metadata codec as it encodes and decodes the value of an object into metadata about it. The top layer can be seen itself as another codec as its encodes and decodes the metadata about the value of an object into some arbitrary specific code (a domain-specific XML for example).

You can see that these layers are implicitly present in many existing serialization frameworks. For example, in .Net the SerializationInfo object that is the metadata, which is used by different "top layers" like the BinarySerializer or the XmlSerializer.

When designing our framework I decided to formalize and even name these two layers:

The bottom layer is called CODEF, which stands for Compact Object Description Framework.

CODEF uses two separate objects as metadata: descriptors and models. In conjunction, they codify the value of an object. Thus, CODEF encodes such a value as a pair (descriptor+model) and decodes a (descriptor+model) as a value set into a new object which CODEF instantiates.

A descriptor describes the type of the object in a generalized format. It is basically a list of data member fields along with some flags and some method fields . CODEF uses reflection to create descriptors automatically.

A model describes the value of an object in a generalized format. It is basically a list of named values (it is equivalent to SerializationInfo).

CODEF uses reflection to create models automatically.

The intersection between models and descriptors is the name of the data member field. That name matches each entry in a descriptor with the corresponding entry in a model.

The top layer is called CML, which stands for Compact Markup Language.

CML encodes CODEF models as XML files, and decodes appropriate XML files back as CODEF models. We used XML as the final encoding not to interoperate with open standards, like SOAP, but to allow us to inspect saved documents in a text editor in case of versioning problems. This turned out to be very useful as I was able, many many times, to find out in a snap why some old file couldn't be loaded back with the current code. Our application compresses the CML text file using ZLib to produce small files (3 times smaller, on average, than what we had when we started)CML is similar to the XML files produced by .Net's SoapFormatter but is more compact because it doesn't follow all the SOAP protocol (that was not our goal).

Consider the following types:

class Point { int x,y }
class Bbox
{
  Point bottom_left;
  Point top_right;
}
class Figure
{
  Bbox bbox;
}
    

In CML this will look similar to this:

<Figure bbox.bottom_left="0,0" bbox.top_right="5,5"/>
    

Instead of this:

<Figure>
   <Bbox host="bbox">
     <Point host="botton_left">0,0</Point> 
     <Point host="top_right">5,5</Point> 
   </Bbox>
</Figure>
    

If you have ever seen serialization-based XML files you are likely to be familiar with the second verbose form but not with the first compact form.

The first thing to notice here is that CML can use XML attributes instead of XML elements, even for data members (the XML attributes are those name="value" tags right after the Figure markup).

If you look closely enough, you'll notice that the CML attributes, bbox.bottom_left and bbox.top_right, placed in the context of the encoding for the value of a Figure object, refer to a data member of a data member. That is, CML can encode the value of a data member nested any level deep directly from the root object as an XML attribute of the form:

"data_member.sub_data_member.sub_sub_data_member.ยทยทยทยท=value"
    

Listing 2 shows some sample illustrative user code.

[Described] public struct Point
{
  public Point() {}
  public Point( float x_, float y_ ) { x  = x_; y = y _; }
  
  [DField] float x,y;
}

[Described] public class Circle
{
  [Fixup] object OnLoad()  
  {
    perimeter = 2 * Math.PI * radius;
    area      =     Math.PI * radius * radius;
    return this;
  }
  
  [DField] [InPlace] Pen   pen;
  [DField] [InPlace] Point center;
  [DField]           float radius;
  
  double perimeter;
  double area;
  
  Circle() {} // CODEF needs a default ctor,
                     // but it can be private
}
  
Listing 2

CODEF/CML is based on Reflection, but in order to keep it simple, it doesn't attempt to analyze each and every type in the system (though it could as Reflection permits that). Instead, you need to explicitly tell the framework which types it should cover. That is the purpose of the [Described] attribute prepended to the definition of the Point and Circle types.

As I've already mentioned, the main goal of CODEF/CML is to allow you to decide which data members must be saved. Hence, only those data members explicitly marked with the attribute [DField] are modeled (thus serialized by CML).

A CODEF descriptor object contains a list of DField objects, each one in turn encapsulating a .Net reflection's FieldInfo object which essentially contains all the needed information about a data member (including methods to set and get its value).

A CODEF model object contains a list of MField objects, each one it turn encapsulating a string with the field's name and an object with the field's value. There is no type information in a model's MField, but each MField is implicitly associated to the corresponding descriptor's DField by the field name. Together, a descriptor and a model completely codifies an object's value.

Each data member you mark as [DField] contributes a DField+MField pair in the encoding. Those data members which are not marked as [DField] are simply left out completely.

When CODEF needs to set the decoded value into a new object it uses the default constructor to instantiate the object and then it sets each DField automatically via reflection. Since the unsaved fields may depend on the saved fields (they usually do), CODEF calls the method marked with the attribute [Fixup], if any, whose job is to recompute the unsaved dependent data members. This method can be private and can be named any way you like (because CODEF detects the method by its attribute, not by name).

The attribute [InPlace] is parsed by CODEF and merely becomes a flag in the corresponding DField and MField. CML then interprets that flags as indicating that the value must be encoded as an XML Attribute of the parent Element.

Serialization, as a process, can be considered as the transfer of values from a sender object to a receiving object, with a dimension of time or space in between. Requiring the class definitions of the sender and receiver objects to match exactly is a desirable but largely unfeasible goal: I've never had the luxury of working on a system for which serialization facilities were added after the object model for the system was completely finished. In practice, you start serializing objects whose structure keeps changing after the first files are saved. Even if initially you simply do not support old files, sooner or later, earlier end users like testers begin to save files with objects still under development.

Versioning is the term used to refer to all the synchronization required to match the design subspace of the sender with the design subspace of the receiver. I speak of design subspace instead of class definitions because in some extreme scenarios the two subspaces might contain totally unrelated classes.

To my knowledge, the only systems that are capable of totally matching completely unconnected design subspaces are those which communicate via a high level generalized code. The best example that comes to mind is HTML and all its derivatives, from XML to SVG.

In classical versioning, problems appear when you start changing serializable classes but you need to read files saved with the old definitions. The simplistic solution is to populate the design space with the history of changes: that is, you never really change class A, instead, you create a new class B to replace it. Although this makes versioning a complete non-issue, it is, like most simplistic solutions, totally useless: imagine the design space after years of changes in a system with 200 (active) classes.

Class definitions continuing to change after 2, 5 or 10 years is not at all uncommon. In fact, it's called refactoring and is the best thing that can happen to old source code.

In versioning there are 3 archetypal scenarios of increasing complexity:

The first scenario is when you delete or add serializable data members to a class. This can be handled easily in most serialization frameworks: Deleted data members are extracted back but just left unset (because they are no longer in the class), and new data members are simply not extracted at all and you need to explicitly give them a sensible default.

In the .Net framework, if you use the automatic approach (simply marking a class as [Serializable]) you'll get an exception whenever the current class definition contains data members that were not saved, but in the low level approach you can simply set the new data members to a default value in the serialization constructor (the one taking a SerializationInfo as a parameter).

Traditionally, a version number is saved along with every object so you can know, on load, which class definition was used when that data was saved. Unfortunately, version numbers are extremely error prone since it is totally up to you to relate a number to a particular historical class definition. Using version numbers successfully requires an uncommon discipline as you need to keep proper record of the definitions for each number, which in practice means a lot of side work whenever you change a class.

Using the low-level .Net serialization approach you can add the version number to SerializationInfo even if that is not really part of the object.

Alternatively, using .Net serialization, you could also enumerate each entry in the SerializationInfo and match that, programmatically via reflection, with the actual data members in the class, setting only matching members.

The second scenario is when the type of a data member changes, or the name of a type changes. This typically breaks most serialization frameworks, like .Net serialization, because the type of each value is saved so that the loader can read the value back (even if the static type of a value is generic, like "object", its dynamic type must be concrete and the loader needs to know which is it).

The third scenario is when the design space changes radically (entire class hierarchies are replaced with new ones). The best and possibly only solution here is to keep the old classes around, read them, and make all the necessary conversions.

There is a fourth scenario that is actually outside the domain of any serialization framework but which is related to serialization nevertheless: when serialized objects hold non-serializable objects. A non serializable object could be a Bitmap, or a Font, or some opaque third-party type whose state is hidden to the application. In these cases you cannot, or would not, save the actual object's state, so instead you save something, like a file name, or a string concatenating a Font Family name and Style, that, in your application, refers to the object. On load, typically as a global postprocessing stage after all the objects have been read back, you set the actual object within its parent locating it using the saved reference.

I've described how CODEF uses both models and descriptors, and you might have asked why two separate objects with an implicit correspondence and not just one, like SerializationInfo?

Simple: to simplify some versioning issues. How? Because descriptors are always current. That is, when you load a class, both CML and CODEF uses the descriptor of the current definition of the class. Unlike any other serialization framework I've ever seen, the loader is not tied to the potentially outdated description of a class that is stored in a saved file.

Since descriptors are always current, CODEF knows when a saved MField (from the saved model) no longer matches a current DField (because a data member was removed) so it just ignores it. It also knows when there are unmatched DFields, that is, new data members. In this case though our current implementation simply assumes that the default constructor gives ALL fields a sensible default, so it also just ignores unsaved new data members.

CODEF always calls the default constructor to instantiate a new receiving object beforei the saved fields are set. This is suboptimal, yes, because saved fields are first initialized with a default value and then assigned their actual values. We just didn't consider this issue critical enough to complicate the design specially considering that managed objects, unlike unmanaged C++ objects, use a memory model in which alli data members are initialized, either to zero or to the default value given in the member definition, before any constructor is calledi (thus you just cannot use a special constructor that does nothing as you could in unmanaged (pure) C++). [I do not know if the compiler optimizes away the default initialization of data members which are explicitly assigned in the constructor.. but I guess not]

Recall the figure example:

<Figure bbox.bottom_left="0,0" bbox.top_right="5,5"/>
    

If you look even closer than before you'll notice that there is only one type there: Figure.

When CML parses that line back it knows it has to produce a model for an object of type Figure, so it uses CODEF to get a descriptor of Figure. This descriptor is always up-to-date with the current definition of Figure. That Figure descriptor tells CML that, currently,i a Figure has a field named bbox of type BBox. Similarly, CML gets a current descriptor for BBox so it knows that a BBox has two fields named bottom_left and top_right of type Point. As you can see, it doesn't at all matter which type bbox and its own members had when this was saved.

Normally, as in .Net serialization and every serialization framework I've ever seen, the saved data explicitly encodes the concrete type of the value being saved to allow the loader to regenerate the object that corresponds to that value. This introduces what I call an early type bindingi: by the time you get the value from the loader it is already of a concrete type that is defined by the saved data instead of the variable that is receiving it. However, since the type of the saved value must be, necessarily, constrained by the declared type of the variable that will receive the value on load, such early type binding can be worked around, in some cases, using descriptors as the loader knows the current declared type of the receiving variable and can use it to regenerate the object.

Suppose you have the following struct:

[Described] struct Point
{
  public Point() {}
  public Point( float x_, float y_)
              { x = x_; y = y _; }
  [DField] float x,y;
} 
    

The data member fields x and y are not polymorphic so the objects reconstructed by the loader must be of type float, and CODEF knows that because the current descriptor says so. Consequently, there is no need at all to include the type in the serialized data. This not only saves space, which is significant by itself, but it also allows you the change the type of x,y provided that the encoded values of x,y (strings in the case of CML) can be decoded back into the new type. For example, you can change float to double and it just works, without any extra work on your part.

You might be thinking that you can also change float to double using the .Net serializer since you can simply convert the float read back to a double at the point where the value is assigned to the data member. That's correct and you can always use a conversion to handle type changes, but using the currently declared type of the receiving variable might skip the conversion altogether (as in the float->double case above).

Unfortunately, the declared type of the receiving variable cannot always be used to reconstruct the saved object so CODEF cannot always omit the type in the saved data. One case is when the declared type is explicitly or implicitly just object (implicitly is the case of a container like ArrayList). Another case is when the declared type is polymorphic: when the declared type of the variable is Base, but the concrete type of the object held by the variable is Derived.

CODEF/CML fully understands only [Described] structs/classes (but this is by design and not an inherent impossibility since via Reflection you can create a descriptor/model for any type in the system). Non-described types are classified in 3 groups: containers, primitive types and everything else. CODEF/CML needs to detect containers because it has to encode the values of the contained objects differently than it does for data-members (there is no field "name" for instance).

Primitive types are detected as such because if the type is declared in a data member field (that is, the primitive value is not stored in a container or boxed in an object) CML does not need to encode this type (it is left implicit in the CML file). For everything else, for which CODEF has nothing to say, CML has no choice but to encode the concrete type, even if it is not polymorphic.

Values of a non-described type are atomici from the CML point of view (CML cannot access its structure without a descriptor for it). In a CML file, atomic values are rendered as a single XML text.

The process of converting an arbitrary value to and from a string is far from trivial. In fact, the whole serialization framework can been seen as doing just that. I call that process textualization: iA value can be textualized, that is, encoded as a string; and can be detextualized, that is, parsed back from a string. Textualization is not exactly the same as conversion to/from string. The difference is that textualization requiresi the conversion to be round tripi: that is, detextualize(textualize(val))==val imust hold for any value val. This requirement is often not fulfilled by string conversion functions.

The fundamental problem of textualization in CML is that it needs to textualize values of arbitrary types, including those it knows nothing about (though it could using reflection). For that reason, CML doesn't handle that at all. Instead, it uses a special singleton object called Textualizer, which can be seen as a side-product of the framework.

The textualizer knows how to textualize values of primitive type (it uses .Net XmlConvert for that). For other types you can either implement the interface ITextualized ior register, non-intrusively, an iItextualizedSurrogate (Listing 3).

interface ITextualized { string Textualize(); }
public struct Pen : ITexutalized
{
  public Pen( uint color_, int width_ )
  {
    color = color_;
    width = width_;
  }
 
  public string Textualize() 
  { 
    return Color.ToString() + "," + Width.ToString();
  }
 
 static public object Detextualize( string aTextual )
  {
    string[] lTokens = aTextual.Split(",");
    uint Color = int.Parse(lTokens[0]);
    int  Width = int.Parse(lTokens[1]);
    return new MyPen(Color,Width);
  }
  uint Color;
  int  Width;   
}
  
Listing 3

In CML, values of type Pen are rendered as a single string which is even more compact that using [Described] (that's why there is this option)

If the type is third-party you must implement the textualization agent as a separate class and register it with the Textualizer, Listing 4:

public interface ITextualizedSurrogate
{
  string Textualize( object aO );
      object Detextualize ( string aTextual );
}
public class Color_TextualizedSurrogate :
 ITextualizedSurrogate
{
  public string Textualize( object aO ) 
  { 
    return TextualizeColor((Color)aO); 
  }
  public object Detextualize ( string aTextual ) 
  {
    return DetextualizeColor(aTextual); 
  }
  static public string TextualizeColor(
     Color aColor ) 
  {
    uint lColorValue =  ( (uint)aColor.A << 24 )
                       +( (uint)aColor.B << 16 )
                       +( (uint)aColor.G << 8  )
                       +( (uint)aColor.R );
    return XmlConvert.ToString(lColorValue);
  }
  static public Color DetextualizeColor (
     string aTextual )
  {
    uint lColorValue =
       XmlConvert.ToUInt32(aTextual);
    byte lA = 
      (byte)(( lColorValue & 0xFF000000 ) >> 24 );
    byte lB =
      (byte)(( lColorValue & 0x00FF0000 ) >> 16 );
    byte lG = 
      (byte)(( lColorValue & 0x0000FF00 ) >> 8  );
    byte lR = 
      (byte)(( lColorValue & 0x000000FF ) );
    return Color.FromArgb(lA,lR,lG,lB);
  }
}
Textualizer.RegisterSurrogate(typeof(Color), 
   new Color_TextualizedSurrogate() );
  
Listing 4

Except in the case of implicit typing of unboxed primitive types, CML needs to encode a Type as a string and get a Type back from its string ID. This is similar to the textualization problem except that the object that needs to be recreated from a string is a Type.

Given an object t of type Type; t.FullName is a string encoding that is guaranteed to fulfill the round-trip requirement when used as an argument to its counterpart method: Type.GetType(string).

This should be enough; but it isn't, because Type.GetType() returns null if the Type is in a different Assembly (DLL) than the one calling that Type.GetType().

To get back a type from its FullName encoding you need to search for it in all the Assemblies of your application.

Again, CML itself doesn't handle this but instead it relies on a TypeMap to do that.

A TypeMap is anything implementing the following interface:

public interface ITypeMap
{
  Type GetType( string aID );
}
    

When you call CML.Read() to load a file you must pass some ITypeMap to it.

Typemaps are chained and the GetType() request is passed down the chain until someone returns a non-null Type. The current framework implementation comes with an ExplicitTypeMap, a SystemTypeMap which merely returns Type.GetType(aID), and an AssemblyTypeMap which searches the type in the entire system in case none of the other maps find it.

The ExplicitTypeMap, normally the first in the chain, is there to help with a sort of versioning issue which is typically a huge problem when it shouldn't: type renaming. If the saved data speaks of type "animal" but the current class name is "Animal", you're in trouble even if that's the only thing that changed. But what if you could tell CML that "animal" is now called "Animal"?. Well, you can... using an ExplicitTypeMap registering with the Type that corresponds to a given string ID.

CML Encoding

The job of CODEF is to encode and decode types and values in a generalized form: descriptors and models. But that's just half the story. The job of CML is to encode and decode types and objects, using CODEF descriptors and models when it can, into an XML-like text file.

CML encodes objects based on the following rules which apply recursively to each distinguishable subobject.

If the object was already rendered as XML, then it has an Instance ID (administered by CML) and is encoded as an XML Element: <HostField href="#instanceID /> or an XML Attribute: HostField=#instanceID. HostField is the name of the corresponding data member field on the parent object (if any, items in a container for instance have no host field).

The choice between an XML Element or Attribute is given by the InPlace flag of the field (controlled via the [InPlace] attribute).

If the object is modeled, which means that its dynamic type is described and CML can create a model for it via CODEF, it is encoded as an XML Element:<ConcreteTypeName id="#instanceID" host="HostName">along with XML Attributes or XML Elements corresponding to each MField (each data member).

If the object is unmodeled but is a container, it is encoded as an XML Element:<ConcreteTypeName id="#instanceID" host="HostName">along with XML Elements for each item in the container.

If the object is unmodeled but its declared typei is primitive but not object, it is encoded as an XML Attribute: "HostField=textualized-value"

If the object is unmodeled but its declared type is object or non primitive, it is encoded as an XML Element:<ConcreteTypeName id="#instance" host="HostField">textualized-value</ConcreteTypeName>.

CML Decoding

Upon decoding, CML must regenerate objects, and for that it needs to get to its type first. If the CML encoding includes the type, as is always the case except for primitive unboxed fields, CML uses the TypeMap it receives to get the Type of the saved value. If the type is implicit, CML uses the HostField to lookup the corresponding DField in the descriptor of the parent object to get to the needed type.

If the object to be regenerated is modeled (that is, its type is described), CML decodes the XML Element creating a model object out of it and passes that to CODEF to complete the regeneration. If the object is not modeled, CML decodes the XML Element or Attribute using the singleton Textualizer to detextualize the string which is encoding the value into the resulting object.

Each regenerated object of reference-type (instead of value-type), which always has an instance ID, is kept in a dictionary with its ID as key. Thus, if the XML Element or Attribute is a reference to an instance ID, the object is just taken from the dictionary.

CML uses the HostName to lookup a matching DField in the descriptor of a parent class. If there is no such DField the value is just ignored (as it normally corresponds to a data member deleted from the class), unless the struct/class contains the following special method:

[SetField] void SetField ( Type aType, string aName,
                              object aValue );
    

which is then called for any mismatching value.

If there is a DField for a particular MField (that is, the data member still exists) but the concrete type of the data member object (as encoded in the CML) is not a subtype of the declared type of the DField, CODEF throws a TypeMistmatchException unless the class has a SetField method, in which case it just calls it, passing the saved type as the aType parameter and letting that method take care of the conversion.

You can tell CODEF to call SetField directly without testing if the regenerated data member object is of the right type by marking the data member as [ManualSet]. By itself this isn't very useful, but it is when you also mark the data member as [ManualGet]. ManualGet tells CODEF to simply bypass itself and do not encode the data member value in any way (as a model for instance). Instead, CODEF calls the following special method:

[GetField] object GetField ( string aName,
                                object aValue )
    

and lets youi encode the object that CML will see and recreate.

The attributes [ManualSet] and [ManualGet] can be shortcut if used together as simply [Manual]. Manual fields are useful for data members that just can't be serialized via its data members, or for third-party objects which can't be serialized by textualization (CML will just textualize unmodeled objects).

A last but still interesting CODEF feature is the fact that the Fixup method returns an object. This is necessary because CODEF/CML automatically set data members unless they are marked as ManualSet. The object returned by the Fixup method allows you to keep a data member automatic even if its type changed critically.

Consider the follow scenario:

At some point in time you have a Collection class, with some complex structure, and lots of files saved with that in. But then, later on, you refactor the design and the Collection class is replaced by a Group class which is totally differenti. You just have to keep the Collection class around (a stripped down version actually) so that CML can regenerate objects of that type when they are found in CML files. But that's not sufficient by itself: Data members that used to be of type Collection are now of type Group, so you need a way to convert a Collection read from an old file into a Group before assigning it to the data member. We already saw a case of type change that was handled by implicit typing, but implicit typing applies to declared primitive types only... here you need an explicit conversion.

Using .Net serialization you would solve this by explicitly converting a Collection object extracted from a SerializationInfo to a Group object right before the assignment. In CODEF, all you need to do is to add a Fixup method to the deprecated Collection class that converts itself to a Group. That's it.

The advantage of this is that the conversion is in the Collection class itself, and is always called by CODEF right before setting any field that used to be a Collection but now is a Group. This way, you just can't forget to add the conversion in a particular parent class, as you could using the .Net framework.

Future directions

CODEF/CML was developed to solve a specific problem during the lifecycle of a real application. It had concrete goals and was constrained by fixed resources (time). There are a number of improvements and open issues that become evident when you look back at the whole thing.

One of them is that fact that CODEF uses reflection but only on Described types. The idea was to avoid overloading the system with too much reflection, but I wonder now if given the fact that descriptors and models are generated on-demand from the set of types that are requested to be saved, if it really is a big overload to simply reflect on every type so that everything becomes described and all objects modeled. In our Application that is not really a problem because our design space is almost completely proprietary. We use at most 3 or 4 third-party simple structs, period; everything else comes from our own code.

But in most applications that's a very unlikely case.

The [Manual] attribute and the GetField/SetField method used with it is intended to give you some support for third-party types that can't be simply textualized (encoded as a string). Again, that totally worked for us because we just don't use third party objects except a few, and those are so simple that they can be textualized without trouble, but a better approach would be that you can register some form of CODEF agenti that allows you to manually create CODEF models (and maybe even descriptors although these contain special Reflection types that can only be obtained via reflection).

CML needs textualization to get to and from the ultimate text representation, but it uses it for other things too. One of them is to handle types unknown to CODEF, yet, if CODEF is extended as proposed above this use of textualization won't be needed anymore.

Another usage of textualization in CML is to force compactness: If you go back to the textualizer example, you'll notice that the Pen class could have been Described instead of Textualizable. True, but if you want some type, possibly with 2 or 3 data members, to end up in CML as a single string, you just have to do that, but this is really an abuse of the current design. A much better approach is to let you register with CML your own codec for a given type. This would be similar in essence to the CODEF agents proposed above but it would be responsible for creating the XML elements that end up in the CML file (or part of them since some XML parts are mandatory).

Overload Journal #74 - Aug 2006 + Programming Topics