ACCU Home page ACCU Conference Page ACCU 2017 Conference Registration Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinMini-project to Decode A Mini-language - Part Two

Overload Journal #64 - Dec 2004 + Programming Topics   Author: Thomas Guest

Part 1 of this two-part article [Guest] described the preliminary stages of a mini-project to write a codec for a mini-language, delivering:

  • a rough specification of the codec,

  • a suite of test data,

  • some prototype code,

  • three implementation strategies.

Part 2 continues the project and presents the final implementation.

Motivation

Part 1 of this article drew inspiration from "The Art of UNIX Programming", by Eric Raymond [Raymond2]. Part 2 continues to draw from this same source, which applies as readily to implementation as it did to design.

At this point, I can reveal a second motivating source, "The Tale of a Struggling Template Programmer", Stefan Heinzmann [Heinzmann], which served to remind me how frustrating software development can be: sometimes the tools are to blame, sometimes the languages appear faulty, and sometimes the poor programmer takes a wrong turn. More personally, it reminded me that I ought to experiment with modern C++[1].

Anyone familiar with both sources will appreciate there's a degree of tension between them. In what follows, I document my attempts to resolve this tension. Along the way, we shall revisit the world of MPEG video encoding and get started with the Boost Spirit library.

Project Recap

To briefly recap, then, our goal is to write a tool to convert the binary format used in MPEG-2 digital video broadcasting into a textual form and back again - to write a dvbcodec. For example, we would like to convert a section of the Program Association Table (PAT), whose syntax is as follows:

program_association_section() {
  table_id                   8
  section_syntax_indicator   1
  '0'                        1
  reserved                   2
  section_length            12
  transport_stream_id       16
  reserved                   2
  version_number             5
  current_next_indicator     1
  section_number             8
  last_section_number        8
  for(i=0; i<N; i++) {
    program_number          16
    reserved                 3
    if(program_number == '0') {
      network_PID           13
    }
    else {
      program_map_PID       13
    }
  }
  CRC_32                    32
}

[ISO/IEC 13818-1] Table 2-26 - Program association section

The numerical values here represent field widths in bits: the first byte of the section encodes the table_id, the next bit the section_syntax_indicator, and so on until the final four bytes which encode the cyclic redundancy check.

The PAT is just one of the tables we would like to decode. There are many others, the next three most important being the the Conditional Access Table, the Program Map Table and the Network Information Table (CAT, PMT and NIT).

The textual output format we decided on should reflect the syntax description as follows:

program_association_section() {
  table_id                   8 = 0x0
  section_syntax_indicator   1 = 0x1
  '0'                        1 = 0x0
  ...
  CRC_32             32 = 0xcae52d9f
}

We came up with three possible implementation strategies for our dvbcodec:

  • Implement a pat-codec. Then implement a cat-codec, then a pmt-codec, etc.

  • Implement a general codec which understands the full bitstream syntax and can use it to parse an arbitrary section format. All that then remains is to prime this codec with the required section formats.

  • Devise a code generator which, given a section format, will generate a program to encode/decode that particular format.

Towards a Solution

The first strategy holds little appeal: it risks being a recipe for cut-and-paste code and boring repetition. We reject it.

The second and third strategies look to have more going from them, particularly since we have restricted our scope to a subset of the full bitstream syntax. Although these strategies appear rather different, they both require us to parse syntax descriptions of the general form exemplified by the program_association_section.

So, we need a parser. We need one capable of handling conditionals and loops: one capable, that is, of handling a Turing-complete mini-language. Raymond [Raymond2] can advise. In general terms, he suggests:

  • Where possible, reuse. Look for a proven, documented, supported, portable, parser. (He argues these criteria pretty much imply an open source solution.)

  • Prefer scripting languages such as Python and Perl. These facilitate rapid development and are less prone to memory management bugs. You may not need the raw performance offered by C/C++, and the library support offered by these languages is often superior.

On the more specific subject of parsers, Raymond recommends lex and yacc though, in keeping with the Unix philosophy of documenting weaknesses, he admits these tools are not perfect. He also suggests:

"If you can implement your parser in a higher-level language than C (which we recommend you do ...) then look for equivalent facilities like Python's PLY ..."

I tend to agree with Raymond, but I'm not convinced PLY is the way to go here. Of course, it won't get me very far with my aim of finding out about modern C++, but it's also not part of the standard Python distribution. In fact, a web search reveals several other Python parser frameworks - it's unclear which will prevail.

The C++ standard library doesn't provide a parser either. We might make some progress tokenising our data with std::strtok or even std::sscanf, but this won't suffice. lex and yacc are of course a tried and tested solution, but I'd rather not have to learn two more mini-languages.

The next place to look is in the next best thing to the C++ standard library, namely Boost [Boost]. Three clicks from the front page takes us to the Spirit parser, which claims to be a scalable parser framework written in C++. We trust the source, the documentation is good, the examples compile: let's try some code.

Getting Started With Spirit

The code below is a complete program to recognise lines of the form:

reserved 2 = 0x3

this being the format we arrived at for fields of our text sections.

#include <boost/spirit/core.hpp>
#include <iostream>
#include <string>
using namespace boost;
/**
 * @brief Parse a string representing a field 
 * @returns True if the field matches the
 * format: <field_name> <bitwidth> = <value>,
 * false otherwise.
 */
bool parseField(std::string const & str) {
  return spirit::parse(
      str.begin(), 
      str.end(),
      spirit::lexeme_d[+spirit::graph_p]
      >> spirit::uint_p
      >> '='
      >> spirit::hex_p,
      spirit::space_p).full;
}
int main() {
  std::cout << "Enter text.\n"
        << "Lines will be matched against: \n"
        << "<field_name> <bitwidth> = "
        << <hexvalue>\n"
        << "Type 'q' to quit\n";
  std::string str;
  std::string const quit("q");
  while(std::getline(std::cin, str) &&
        str != quit) {
    std::cout << (parseField(str)
                          ? "hit" : "miss")
              << std::endl;
  }
  return 0;
}

Here, the action is concentrated in the function parseField(), which wraps a call to spirit::parse(). spirit::parse() accepts as arguments:

  • two iterators marking the start and end of the data to be parsed,

  • a parser,

  • a skip parser.

We have used spirit::space_p directly as our skip parser: this primitive parser recognizes whitespace and tells spirit::parse() which characters it should skip past in the input. A more sophisticated skip parser might be used to skip comments.

Operator Overloading

The parser itself is a sequence of sub-parsers which can be read: recognise input consisting of a block one or more printable characters, followed by an unsigned integer, followed by the equals sign, followed by a hexadecimal integer.

Operator overloading is used by Spirit to make such expressions into readable approximations of EBNF syntax descriptions (see also [Alexandrescu] for more on this technique). Here, we see that operator<<() has been overloaded as a sequencing operator, prefix operator+() has been overloaded to mean "one or more of", and operator[]() is overloaded to adapt the behaviour of a sub-parser - in this case using spirit::lexeme_d to turn off whitespace skipping.

Parser Generators

I should also mention that the '=' sub-parser is a shorthand for spirit::ch_p('='), which in turn is a parser generator returning the character literal parser spirit::chlit<char>('=').

Similarly, spirit::hex_p and spirit::uint_p are parser generator functions which return suitable specialisations of the spirit::uint_parser template struct. The full template parameters of this struct are as follows:

template<typename T = unsigned,
         int Radix = 10,
         unsigned MinDigits = 1,
         int MaxDigits = -1>
struct uint_parser { /* */ };

The helper functions hex_p and uint_p are often good enough, but it's also useful to have the full flexibility of the base parser. For example, if we need to match larger hex values, and long long is available, we could create an alternative hex parser:

uint_parser<unsigned long long, 16> const
  long_long_hex_p
    = uint_parser<unsigned long long, 16>();

In fact, the uint_parser should work with any user defined scalar type.

(You've probably noticed I'm now working in the boost::spirit namespace. I continue to do so for the remainder of this article.)

One thing I cannot do with the hex parser, unfortunately, is get it to accept the 0x we've used to prefix hex digits. We can fix the bug in our program by introducing a new parser rule.

with_base_hex_p
  =  lexeme_d
     [
       as_lower_d["0x"]
       >> hex_p
     ];

Note here:

  • the as_lower_d directive, which converts all characters from the input to lowercase, and therefore recognising both 0x and 0X.

  • the rather unusual code layout. I have tried to follow the Spirit style guide [Spirit3] when writing parser grammars. This will become increasingly more important when we develop a more substantial grammar.

  • the string literal "0x", which in this context becomes yet another parser.

Semantic Actions

Simply recognising fields is not enough: we need to act on their contents. That is, we must associate semantic actions with the sub-parsers. This can be done using another overload of operator[](), which enables us to link an action to a parser.

Here, then is an encoder which will convert text versions of sections to binary. I have omitted #include directives etc. for brevity. The full implementation is available with the source distribution for this article [Homepage].

typedef std::string::const_iterator iter;
/**
 * @brief Put the input value to the output
 *  stream using the specified bitwidth
 */
void putBits(std::ostream &, unsigned w,
             unsigned v) { /* */ }
/**
 * @brief Parse a field of the form:
 *  <field_name> <bitwidth> = <value>
 */
bool parseField(iter const & begin,
                iter const & end,
                unsigned & bitwidth,
                unsigned & value) {
  return parse(
      begin, 
      end,
      lexeme_d[+graph_p]
      >> uint_p[assign_a(bitwidth)]
      >> '='
      >> lexeme_d
         [
            ! as_lower_d["0x"]
            >>  hex_p[assign_a(value)]
         ]
      ,
      space_p).full;
int main() {
  std::string str;
  int line = 0;
  try {
    while(std::getline(std::cin, str)) {
      ++line;
      unsigned bitwidth, value;
      if(parseField(str.begin(), str.end(),
                    bitwidth, value)) {
        putBits(std::cout, bitwidth, value);
      }
    }
  }
  catch(std::exception & exc) {
    std::cerr << "Error parsing line " << line 
              << "\n" << str << "\n"
              << exc.what() << std::endl;
    return -1;
  }
  return 0;
}

Note here:

  • I have used typedefs for the iterators passed into the parser. This will ease switching to another forward iterator type, if required.

  • I decided to make the 0x preceding hexadecimal values optional, using Spirit's overload of operator!()

  • The use of the assign_a actor for our semantic action. We could have used any function accepting an unsigned integer or any functor providing operator()(unsigned int). Again, it's simpler to use one of Sprit's off-the-peg actors.

  • The program implements a classic Unix filter. This makes it suitable for use in a Unix pipeline. See [Raymond2] for more on this. Unfortunately, I'm not sure this is a great idea for binary output: I haven't found a portable way to reset std::cout for binary.

Exceptions in Parsers

Another important point to note about our simple encoder is the way it handles failure conditions using C++ exceptions rather than C-style error codes. There are plenty of failure conditions to handle: a value might not fit in the available bitwidth, the output stream might not be in a suitable state, and so on.

In this simple parser we might equally well have passed error codes around, but a more complex parser is likely to involve recursion and/or nested function calls. Exceptions perform well in both the simple and the complex case, offering a scalable solution.

The Spirit parser framework itself uses exceptions internally for similar reasons. To quote the documentation:

"C++'s exception handling mechanism is a perfect match for Spirit due to its highly recursive functional nature. C++ Exceptions are used extensively by this module for handling errors."

Like our program, Spirit should not leak any such exceptions to its users.

Weaknesses

The simple encoder presented above follows Postel's prescription, to a degree [Postel]. It doesn't mind too much about whitespace; it allows any sequence of printable characters as a field name; and it isn't fussy about the presentation of hexadecimal numbers.

Its main flaw is that it does not look at the text format of our sections as a whole: it simply skips the lines which close blocks or start loops, for example. This means the encoding will quietly do the wrong thing given input where a new-line has gone missing, or where the data has been truncated. This is dangerous. It also means the encoder cannot check the integrity of our text data - for example, to confirm the section_length field contains the actual section length, or to validate a CRC.

When we start thinking along these lines, we realise that perhaps the encoder should calculate the values of these fields for us. We'll need a CRC generator anyway - why not embed it in the encoder?

These are important points. However, we never considered data validation when we planned our codec and I'm not going to worry about it just yet - I need to get started on the decoder. Data validation, though valuable, would need to be optional since an encoder must let us generate broken data for test purposes.

Also, Raymond [Raymond2] encourages us to limit options whenever possible: if we can release code earlier then our users can tell us which options they really want. Ideally, he suggests we make the release open-source, and allow users to (submit patches which) implement these options.

Progress Review

We've used Spirit to write a micro-parser to drive the encoder. We're ready to start on the decoder. Spirit's scalability will be tested.

The Decoder

I decided to attempt the second implementation strategy: to develop a codec which understands the bitstream syntax and can use it to parse an arbitrary section format. I had no good reason for preferring this to the code-generator strategy.

As already noted, this is a parsing task. We will use Spirit to define the grammar used by the bitstream syntax. We can then parse our static program data - the section formats we're interested in - which gives us the basis we need to parse the run-time program inputs, that is, actual instances of binary encoded sections.

Grammar Definitions and Parse Trees

I do not propose to dwell on the practical use of Spirit for much longer: we've already seen enough of what it can do, so for full implementation details please refer to [Spirit2] and the codec source distribution [Homepage].

For the decoder, note that simply parsing the data once and associating semantic actions to the various lexical elements is not enough. For instance, to process descriptor loops we need to revisit the same parser node several times. Spirit provides abstract syntax trees for exactly this purpose.

I do think it is worth presenting here a portion of the section grammar. To me, this is a quite remarkable application of C++. For even more remarkable transcriptions of EBNF syntax definitions into Spirit grammars - including a C++ tokenizer and a C parser - I would recommend a visit to the Spirit Applications Repository [Spirit].

/**
 * @brief MPEG-2 Section grammar defined
 * using Boost Spirit.
 * Reference: - ISO/IEC 13818-1, MPEG-2
 *              Transport Stream
 */
struct Section :
  public boost::spirit::grammar<Section> {
  template <typename ScannerT>
  struct definition {
    definition(Section const & /*self*/) {
      section_
        =  section_ref_
           >> section_body_
        ;
      section_ref_
        =  text_id_
           >>  '('
           >>  ')'
        ;
      text_id_
        =  leaf_node_d[
             lexeme_d[
               alpha_p
               >> * (alnum_p | '_')
             ]
           ]
        ;
      quoted_binary_
        =  leaf_node_d[
             lexeme_d[
               '\''
               >>  bin_p
               >> '\''
             ]
           ]
        ;
      section_body_
        =  ch_p('{')
           >> *(    field_
                 |  loop_
                 |  conditional_
                 |  section_ref_
               )
           >> '}'
        ;
      field_
        =  identifier_
           >>   uint_p
        ;
      identifier_
        =  text_id_
           |  quoted_binary_
        ;
      conditional_
        =  str_p("if")
           >> condition_
           >> section_body_
           >>  ! (str_p("else")
                  >>  section_body_)
        ;
      condition_
        =  inner_node_d['('
             >> text_id_
             >> "=="
             >> quoted_binary_
             >> ')'
           ]
        ;
      loop_
        =  loop_control_
           >> section_body_
        ;
      loop_control_
        =  leaf_node_d[str_p("for")
             >>  '('
             >>  * (anychar_p - ')')
             >>  ')'
           ]
        ;

    }
    ...
    boost::spirit::rule<ScannerT> const &
    start() const {
       return section_;
    }
  };
};

Decisions Taken

Many of the decisions taken when writing our naive encoder scale up to the decoder: limited use options; exceptions used in preference to error codes; Spirit style guide for grammar definitions; typedefs for iterators.

Some decisions were ones we haven't yet faced. The main one was: where should we put section format definitions (for the PAT, CAT, PMT and NIT)? There are two obvious alternatives:

  • create a C++ source file containing these definitions - perhaps as an array of string literals,

  • place them in a text file in a known location, and have this file read when the decoder starts up.

The second alternative is perhaps most faithful to our original aims. Program logic and program data are nicely separated, and extending the decoder to handle other sections is a simple matter of editing the text file. No recompilation necessary.

Despite these attractions, I went for the first option - partly because it's easier to implement and partly because I didn't want to work out where to put the text file.

The other corner I cut concerns determining how and when to exit loops. The issue is fully described in the first part of this article (see the subsection headed "Complications"). My resolution was to notice that loops always exit when we've used up the number of bytes encoded in a length field - with the single exception of the outermost loop, which may end four bytes early in order to leave space for a CRC. So, the decoder maintains a stack of length fields, testing the top value after each loop iteration, popping it on loop exit. The first item to be stacked may need adjusting to allow for the four byte CRC. Again, the source distribution should make this clear.

I could find no official statement regarding what could be used as a field name in the section format definitions: inspection suggested that these names were rather similar to C identifiers, with the important addition of quoted binary values for fixed fields (e.g. the '0' which is the third named field in the program_association_section format definition).

A few more parse tree node directives might have resulted in a leaner decoder, but I wanted the syntax grammar to be as simple as possible. I am inexperienced in writing grammars and preferred a small amount of extra code in my application. The application is quite compact anyway.

Ship Happens [Alexandrescu]

Raymond [Raymond2] has lots of practical advice on how to ship a source code distribution, going down to details of file naming conventions. Some of the suggestions I have followed are:

  • choose a suitable license

  • include a README

  • set up a project homepage [Homepage]

  • include a BUGS file, listing known defects and limitations

  • include platform/compiler details

  • include self-test code

The BUGS file is a strangely satisfying thing to write, particularly if you've ever delivered software which doesn't admit to defects, let alone limitations (or indeed if you've ever used such software). In this case it is essential to document both.

Version 0.1 of the dvbcodec features the naive encoder described in the preceding - really only of use for system testing (we can check that decoding then encoding a file recreates the original file). The decoder is rather better - in fact, I've extended it beyond the original specification to handle a few more section formats: dvbcodec -l gives details.

Having done the hard work and shipped our beta release, the rest of this article is dedicated to some closing thoughts.

Is C++ the Right Tool for the Job?

My criteria for language selection were somewhat artificial. If I had been allowed a free hand I almost certainly would have been biased towards Python [Python]. However, having gone the C++ route - the modern C++ route, even - it would seem a good point to step back and review my selection.

Raymond [Raymond2] has reservations about C++, which I summarise here:

  1. By not automating memory management it fails to address C's biggest shortcoming; and backwards compatibility with C has compromised the language's design.

  2. Object oriented software design isn't all it's cracked up to be. All too often it leads to shaky object hierarchies, unnecessary abstractions and code which is hard to maintain,

  3. C++ is so complex that no one programmer can be expected to know it all,

  4. If C++ really were superior to C, it would now dominate it.

(Incidentally, as already mentioned, Raymond is not knocking C++ to promote C. His recommendation is to adopt scripted and mixed language solutions.)

To fully address all these points is outside the scope of this article. Instead I shall look at each in the context of the development of our codec:

  1. By using standard library containers - map, vector, stack, string etc - I have avoided a single direct call to operator new. If I've got my exception handling and my use of Spirit right, there should be no leaks. Regarding C compatibility, I have benefited from the C standard library in a few places. Converting from C string literals to C++ strings is convenient.

  2. The application code (as opposed to the Spirit framework code) uses only two concrete classes. I have resisted the temptation to make these generic, or to make them derive from an abstract class. The RAII class idiom is usefully employed. The Spirit framework itself has moved with the times: what was "implemented with run-time polymorphic classes" is now "a complete rewrite ... using expression templates and static polymorphism".

  3. Agreed! Spirit's fine documentation includes examples which have served as a basis for my own application. When I deviated from these examples too far the result was a barely comprehensible torrent of compiler errors. Typeless programming in a strongly typed language can be tough[2].

  4. C is far more portable. I believe our codec is standards compliant, so maybe in the long term it will be portable. However, at the moment (September 2004) the list of compilers which cope with Spirit is small. So our codec isn't really portable. Not one of the compilers I use at work could cope with this program. This reflects my experience with C++ over the years: to get the good stuff, either you wait, or you work around compiler limitations. Bear in mind too that of two types of compiler bugs - incorrect error messages, no object code; no error messages, incorrect object code - the second is far more insidious and dangerous: and the presence of the first naturally leads a programmer to suspect the existence of the second.

Despite all this, Spirit has proven itself flexible, scalable, capable of expressing grammars clearly and of writing efficient parsers without the need to step outside C++. Indeed, it could never have been done without C++. I am sure I will use the Spirit parser framework again.

Open Source

The future of Unix is Linux is open source. Raymond is passionate about software quality and argues convincingly that open source the best way to deliver the highest quality software. I do not propose to rehearse these arguments here: Raymond's writings are available both in print form and online. (See, for example [Raymond]).

What does interest me is that I cannot see how the full power of Spirit could be realised using anything other than a full source code distribution. It's all done with header files. Maybe with some reworking the implementation could be delivered in pre-built libraries, multiplied up by the various operating system, platform, version permutations - but wouldn't this make it even harder for compilers to build applications based on Spirit?

Of course, open source means more than just access to source: but it's still notable that this style of C++ favours open source distribution[3].

And Finally

I'm still not sure if it would have been better to write a code generator, to generate our codec from the section formats.

Maybe I'll try using Spirit and C++ to generate a C codec.

Credits

This article and the accompanying source code was developed using the GNU emacs integrated development environment (GNU emacs, GNU make, GCC, find, grep, etags, PCL-CVS etc), JASSPA Microemacs (for its superb text mode and binary editor), all running on the - sorry Eric, thanks Cygwin - Microsoft Windows operating system.

References

[Alexandrescu] I borrowed this section header from Andrei Alexandrescu's homepage. http://www.moderncppdesign.com/main.html. It's funny because it's true.

[Antonsen] Frank Antonsen, "Stream-Based Parsing in C++", Overload 56, August 2003

[Boost] Boost: http://www.boost.org

[Colvin] Greg Colvin, In the Spirit of C, http://www.artima.com /cppsource/spiritofc.html

[DVB] Digital Video Broadcasting (DVB); Specification for Service Information (SI) in DVB systems

[Heinzmann] S. Heinzmann, "The Tale of a Struggling Template Programmer", Overload 61, June 2004

[ISO] INFORMATION TECHNOLOGY - GENERIC CODING OF MOVING PICTURES AND ASSOCIATED AUDIO: SYSTEMS Recommendation H.222.0 ISO/IEC 13818-1

[Postel] The canonical form of Postel's prescription, according to the Jargon File (http://www.catb.org/~esr/jargon/) is: "Be liberal in what you accept, and conservative in what you send."

[Python] Python: http://www.python.org

[Raymond2] Eric S. Raymond, The Art of UNIX Programming, Addison-Wesley 0-13-142901-9

[Spirit] The Spirit Applications Repository is available at http://spirit.sourceforge.net

[Guest] Thomas Guest, "A Mini-project to Decode a Mini-language - Part One", Overload 63, October 2004



[1] My job involves writing portable C++ to run on embedded platforms. The compilers supplied for these platforms often do not support "modern" C++ features such as templates.

[2] The "Techniques" section of the Spirit documentation [Spirit2] describes an extraordinary method for obtaining an object's type: "... Try to compile. Then, the compiler will generate an obnoxious error message ... THERE YOU GO! You got its type! I just copy and paste the correct type."

Elsewhere, the Spirit documentation mentions Dave Abrahams' proposal to reuse the auto keyword for type deduced variables.

See also Colvin [Colvin] for a radical take on this issue.

[3] The case for access to source code is even stronger for the libraries built using popular scripting languages such as Python and Perl.

Overload Journal #64 - Dec 2004 + Programming Topics