pinShadow Data Types

Overload Journal #94 - December 2009 + Programming Topics   Author: Jon Jagger
Hiding implementation details is a good idea. Jon Jagger shows a technique in C that avoids some of the common problems.

Suppose we have a type called wibble defined as a concrete data type (that is, a type whose representation is fully visible) as shown in Listing 1.

    #include "grommet.h"  
    #include "flange.h"  
 
    typedef struct  
    {  
        grommet g;  
        flange f;  
    } wibble;  
 
    void wopen(wibble * w, int i);  
    ...  

    void wclose(wibble * w);
  
Listing 1

The definition of wibble exposes the types involved in its representation (grommet and flange in this made up example), and hence requires a #include for those types in its header file. This exposure has a price. One cost is that any change to the grommet or flange header files, or any header files they in turn #include, at any depth, will require a recompilation of any source file that #includes wibble.h (either directly or transitively). Another cost is that client code can easily become reliant on the exposed representation rather than relying solely on the functional api. Note that in C++ you can avoid this latter problem by declaring your data members private.

These costs are sufficiently high that software developers have invented techniques to hide a type's representation: turn it into an Abstract Data Type. This is simply a type that does not reveal its representation; a type that abstracts away its representation. In this article I'll look at two abstract data type implementation techniques: Opaque Data Types, and Shadow Data Types.

Opaque data types

The term Opaque Data Type is a well established term for the technique of exposing only the name of the type in its header file. This is done with a forward type declaration. This certainly has the effect of not exposing any representation!

      typedef struct wibble wibble;  
 
      wibble * wopen(int);  
      ...  
      void wclose(wibble *);  

A definite downside with this approach is that clients cannot create objects themselves.

      #include "wibble.h"  
 
      void eg(void)  
      {  
        wibble * ptr; // ok
        wibble value; // constraint violation
        ptr = malloc(sizeof *ptr);  
                      // constraint violation  
      }  

The wibble type's representation is defined in its source file and so only code in the source file can create wibble objects. Furthermore, these wibble objects have to be returned to users as pointers. These two constraints mean the created objects cannot have auto storage class. This is a great loss since auto storage class alone of the three storage class options allows the clients to decide where the objects live which can greatly improve locality of reference.

So how can wibble's source file actually create them? The first possibility is to use an object with auto storage class:

      ...  
      wibble * wopen(int value)  
      {  
        wibble opened;  
        ...  
        return &opened; // very very bad
      }  
      ...  

A second possibility is to create the objects with static storage class:

      ...  
      static wibble wstorage[42];  
      static size_t windex = 0;  
      ...  
      wibble * wopen(int value)  
      {  
        wibble * opened = wstorage[windex];  
        windex++;  
        ...  
        return opened;  
      }  
      ...  

The final possibility is to create the objects with allocated storage class. That is, to create the objects dynamically on the heap:

      ...  
      wibble * wopen(int value)  
      {  
        wibble * opened = malloc(sizeof *opened);  
        ...  
        return opened;  
      }  
      ...  

The static and the allocated approaches have opposing advantages and disadvantages. Static storage is very fast and doesn't fragment the memory but the type has to decide the maximum number of objects the application will need. That might be a dependency going in the wrong direction. Allocated storage is much slower and can create memory fragmentation issues, but the application decides how many objects it needs.

In short, the classic ADT technique creates an abstraction that is very opaque and pays a hefty price for this 'over-abstraction'. Abstracting away the representation also abstracts away the size details of a type and it is the loss of the size information that creates the storage class restrictions. The Shadow Data Type implementation technique attempts to rebalance these forces of abstraction by separating size abstraction from representation abstraction.

Shadow Data Types

The term Shadow Data Type, in contrast to Opaque Data Type, is not a well established term. The technique has probably been around for a long time, it's just that it doesn't seem to have ever been documented anywhere and so a term for it has never become established. I've chosen the term Shadow Data Type to try and convey the idea that when you shine a light on an object it casts a shadow which reveals something of the size of the object but nothing of the details of the object. In other words, a Shadow Data Type has a 'full' type declaration (rather than a forward type declaration) but one revealing only the size of type.

      typedef struct  
      {  
        unsigned char size_shadow[16];  
      } wibble;  
 
      void wopen(wibble *, int);  
      ...  
      void wclose(wibble *);  

The 'true' definition of the type (together with its accompanying #includes) moves into the source file (Listing 2).

    #include "wibble.h"  
    #include "grommet.h"  
    #include "flange.h"  
    #include <string.h>   
    typedef struct  
    {  
        grommet g;  
        flange f;  
    } wibble_rep;  
    // sizeof(wibble) >= sizeof(wibble_rep)  
    void wopen(wibble * w, int value)  
    {  
        wibble_rep rep;  
        ...  
        memcpy(w, &rep, sizeof rep);  
    }  
    ...
Listing 2

However, there are two problems needing attention.

Synchronized alignment?

Firstly, there is no guarantee the two types (wibble and wibble_rep) are alignment compatible. We can solve this problem. The trick is to create a union containing all the basic types. We don't know which basic types have the strictest alignments but if the union contains them all the union must also have the strictest alignment.

      typedef union  
      {  
        // one of each of all the basic types go here  
        // including data pointers and function pointers  
      } alignment;  

We redefine wibble to be a union with two members; one member to take care of the memory footprint and one member to take care of the alignment:

      #include "alignment.h"  
 
      typedef union  
      {  
        unsigned char size_shadow[16];  
        alignment universal;  
      } wibble;  
      ...  

The main problem with wibble being a union is that unions are rare. Suppose you want to forward declare the wibble type in a header. You're quite likely to forget it's a union.

      typedef struct wibble wibble; // Oooops

We can fix this by simply putting the union inside a struct!

      #include "alignment.h"  
 
      typedef struct  
      {  
        union  
        {  
          unsigned char size[16];  
          alignment universal;  
        } shadow;  
      } wibble;  
      ...  

This is now sufficiently tricky to warrant an abstraction of its own (Listing 3).

    #ifndef SHADOW_TYPE_INCLUDED  
    #define SHADOW_TYPE_INCLUDED  
    #include "alignment.h"  
    #define SHADOW_TYPE(name, size)  \  
      typedef struct                 \  
      {                              \  
        union                        \  
        {                            \  
          unsigned char bytes[size]; \  
          alignment universal;       \  
        } shadow;                    \  
      } name  
    #endif  
    #include "shadow_type.h"  
    SHADOW_TYPE(wibble, 16);  
Listing 3

Synchronized size?

The second problem is hinted at by the comment in wibble.c

      // sizeof(wibble) >= sizeof(wibble_rep)  

This comment, like all comments, has no teeth. Ideally we'd like an assurance that if the sizes lose synchronization we're told about it. This can be done by asserting the relationship inside a unit test of course. The problem with this is the possibility that the runtime check inside a unit-test won't get run. Or, more likely, that the unit-test simply won't get written at all. Fortunately in this case we can check the relationship using a compile time assertion. We start with the fact that you cannot declare an array of negative size:

      extern char wont_compile[-1];   
      extern char will_compile[+1];  

Now we have to select a size of either +1 or -1 if the asserted expression is true or false respectively.

      // may or may not compile  
      extern char compile_time_assert[sizeof(wibble)   
         >= sizeof(wibble_rep) ? +1 : -1];  

Hiding this mechanism behind a macro inside a dedicated header helps to make the code more Intention Revealing (Listing 4).

    #define COMPILE_TIME_ASSERT(description,        \  
       expression) extern char                      \  
       description[ (expression) ? 1 : -1 ]  
    ...  
    #include "compile_time_assert.h"  
    ...  
    COMPILE_TIME_ASSERT(  
    sizeof_wibble_is_not_less_than_sizeof_wibble_rep,  
    sizeof(wibble) >= sizeof(wibble_rep));  
    ...
Listing 4

Note that the assertion uses >= rather than ==. This allows binary compatibility with any alternative smaller representation.

It's worth spending a few moments to think about alignment carefully. The wibble type contains a union to give us the strictest alignment. This means a single wibble_rep and a single wibble can overlay each other in either direction.

If we create an array of wibbles the compiler will ensure the address of each wibble is suitably aligned. To do this it may add trailing padding to the wibble type but this padding will be reflected by sizeof(wibble). Similarly, any padding for the wibble_rep type will also be reflected by sizeof(wibble_rep).

Importantly, since sizeof(wibble_rep) may be strictly less than sizeof(wibble) we cannot overlay an array of either type directly onto an array of the other type.

However, we are only concerned with creating an array of wibbles since that is the type the client uses. There should never be any need to create an array of wibble_reps. Nevertheless, the .c file implementation must always do any array pointer arithmetic in terms of wibbles and never in terms of wibble_reps.

Note also that using >= rather than == allows binary compatibility with any alternative smaller representation.

Casting the shadow

Inside the source file we can create a helper function to overlay the true representation onto the client's memory (the fragment in Listing 5 uses the dot designator syntax introduced in c99).

    static inline void shadow(wibble * dst, wibble_rep * src)  
    {  
        memcpy(dst, src, sizeof *src);  
    }  
 
    bool wopen(wibble * w, const char * name)  
    {  
        wibble_rep rep =   
        {  
            .g = ...,  
            .f = ...,  
        };  
        shadow(w, &rep);  
        ...  
    }
Listing 5

Careful use of memcpy can help to make the wibble functions behave atomically from the users perspective. That is to say, the function can do the work off to the side in a local wibble_rep, and copy back into the shadow only if everything is successful.

An alternative to memcpy is to cast the pointer on each access (Listing 6).

    static inline wibble_rep * light(wibble * w)  
    {  
      return (wibble_rep *)w;  
    }  
    void wclose(wibble * w)  
    {  
      wibble_rep * rep = light(w);  
      rep->g = ...;  
      rep->f = ...;  
      ...  
    }
Listing 6

Constness?

It makes no sense to declare a wibble object with a const modifier unless the object can be initialized.

      void pointless(void)  
      {  
        const wibble w; // :-(
        // ... ?  
      }  

However, this is not an issue since the wibble type is opaque anyway. Nevertheless, a slight redesign can accommodate constwibble objects if desired, at the cost of copying struct objects (Listing 7).

    ...  
    wibble wopen(int value)  
    {  
      wibble_rep rep = { ...value... };  
      wibble w;  
      memcpy(&w, &rep, sizeof rep);  
      return w;  
    }  
    void ok(void)  
    {  
      const wibble w = wopen(42);  
      ...  
    }
Listing 7

Summary

In C it is impossible to expose a type's size without also exposing its representation. It is possible to explicitly specify a concrete type's representation as being 'unpublished' but since C does not offer the C++ private keyword using the representation is always possible and remains a constant temptation.. Once one piece of client code succumbs more are sure to follow and like a dam bursting the client and implementation quickly become tightly coupled and any separation is washed away.

Completely hiding a type's representation behind an opaque pointer/handle removes the temptation and creates a powerful abstraction but at the price of hiding the size of the type and the consequent restriction on the storage class of client memory.

A shadow data type offers a half-way house where a type is effectively split into two, with one part exposing the size and the other part holding the representation. The alignment and sizes of the two parts must correspond. Client code is then able to use all three storage class options. The implementation code takes the full load of the extra complexity mapping/overlaying between the split parts. One interesting observation is that the client code would be unaffected (other than needing recompilation) if the representation was moved back into the client side type (to try and improve performance perhaps).

No mechanism is universally applicable and the shadow data type is no exception! Experience and time alone will tell if and how useful it is. Caveat emptor.

Overload Journal #94 - December 2009 + Programming Topics