ACCU Home page ACCU Conference Page ACCU 2017 Conference Registration Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinWith Spirit

Overload Journal #69 - Oct 2005 + Programming Topics   Author: Tim Penhey

Spirit is an object-oriented recursive-descent parser generator framework implemented using template meta-programming techniques. Expression templates allow us to approximate the syntax of Extended Backus-Normal[sic] Form (EBNF) completely in C++. [_1]

EBNF is also known as Extended Backus-Naur Form [_2]. EBNF is a metasyntax used to formally describe a language. In this example the language is the set of possible expressions that are used to restrict SQL select statements.

The sample code shown is all real code, shown with permission of the owner (a financial institution that wishes to remain anonymous). This piece of code was chosen as a "proof of concept" to show how Spirit works and how it is implemented, both to the management and to the other developers.

The application is a trading system in a bank, and the piece of code is responsible for interpreting what the user enters in a free-text field in the interface used to specify search restrictions. For example, the user may just want to search for certain instruments, or all trades in books starting with the letters B through D. The function query_parse (shown below) is the old C version that takes this free text and produces one or more "tokens" for generating the SQL where clause.

---- some header.h
/***********************************************
 * SQL  Token: consists of :
 * 1. logical operator      : and, or, like
 * 2. mathematical operator : <, >, =, <=,
 *                            >=, <>,
 * 3. value                 : the real value 
 *    - i.e. < 30, 30 is the value     
 * Before any cell string gets built into an SQL 
 * sub-clause, it'll be parsed by query_parse() 
 * into a linked-list of SQLTokens, and 
 * query_doit() will build using such SQLTokens, 
 * instead of cell strings directly.
 **********************************************/

typedef struct _SQLToken
{
  char* logic_op;
  char* math_op;
  char* value;
  struct _SQLToken *next;
} SQLToken;
---- source file.cpp
static SQLToken* query_parse (char *string)
{
  typedef enum { NEUTRAL, LOP, MOP, VALUE } 
     STATE;
  char c;
  int index = 0, blank = 0;
  SQLToken *token, *tmp=0, *head; 
  // fix compiler warning - tmp
  STATE state = NEUTRAL;

  head = sqltoken_alloc();
  token = head;

  while( ( c = string[index] ) &&
     ( c != '\n' ) ) {
    blank = 0;
    switch(state) {
    /*****************************************/
    case NEUTRAL:
      switch(c) {
        case ' ':
        case '\t':
          blank = 1;
          ++index;
          break;
        case '+':
        case '|':
          state = LOP;
          break;
        case '<':
        case '>':
        case '=':
        case '!':
        /* if ( first != 0) return NULL; */ 
        /* only the begin of string may have */
        /* no LOP first = 1; */
          state = MOP;
          break;
            
        default :
          /* return NULL; */
          state = VALUE;
          break;
       }

  /* alloc space for next SQLToken, if needed */
       if ((token == NULL) && (!blank)) {
         token = sqltoken_alloc();
         tmp->next = token;
       } 
       break;
         
/*******************************************/
     case LOP:
       switch(c) {
       case '|':
         while ((c != ' ') && (c!= '\0') && 
            (c != '"') && (c != '>') && 
            (c!= '<') && (c != '=') && 
            ( c != '!')) 
            c = string[++index];
         strcat(token->logic_op, "or");
         break;
       case '+':
         while ((c != ' ') && (c!= '\0') && 
            (c != '"') && (c != '>') &&
            (c!= '<') && (c != '=') &&
            ( c != '!')) 
            c = string[++index];
         strcat(token->logic_op, "and");
         break;
       default:
         return NULL;
       }
         
       state = NEUTRAL;
       if ((c != '"') && (c != '>') && (c!= '<') 
          && (c != '=') && ( c != '!')) 
         index++;
       break;
     /*******************************************/
     case MOP:
       switch(c) {
       case ' ':
       case '\t':
         state = VALUE;
         index++;
         break;
       case '<':
       case '>':
       case '=':
       case '!':
         strncat(token->math_op, &c, 1);
         index++;
         break;
         default:
         /* if (token->math_op == NULL) 
            return NULL; MOP missing */
       state = VALUE;
       } 
       break;

     /*******************************************/
     case VALUE:
       switch(c) {
       case ' ':
         index++;
         break;
       case '"':
          while (((c = string[++index]) != '"')
            && (c != '\0') && (c != '\n'))
           strncat(token->value, &c, 1);
         index++;
         state = NEUTRAL;
         tmp = token;
         token = token->next;
         break;
       default:
         while ((c != ' ') && (c != '\0')&& 
            (c != '\n')&& (c != '"')) 
           {
             strncat(token->value, &c, 1);
             c = string[++index];  
           }
         state = NEUTRAL;
         tmp = token;
         token = token->next;    
       }
       break;
     }
   }
   return head;
}

You can see that this code is not very easy to follow, and not overly descriptive in what it does. Clearly it iterates over the character array switching on a remembered state to build up the SQLToken instance. However it is not apparent if there is a bug in the code, and should this method need to be extended due to a change in the grammar, much rework may be needed.

A small piece of history. The application was started around 12 years ago and was originally all C. Policy is now that new development should be in C++, updating old code where necessary. So to bring the interface more into line with C++ the signature was changed to:

std::vector<SQLToken> query_parse(
   char const* input)

The input parameter was not changed to a string as that would not really have gained anything. The calling function had the data as a char const*, and that is also the type for the parameter for the parser. Also the SQLToken definition changed to use std::string:

   struct SQLToken
   {
         std::string logic_op;
         std::string comp_op;
         std::string value;
   };

In order to move the legacy function to Spirit, the grammar had to be defined. By meticulous iteration of the existing function with sample input, the following grammar was extracted.

comp_op ::= '<' | '<=' | '<>' | '>' | '>=' |
   '=' | '!='
logic_op ::= '+' | '|'
value ::= '"' not_quote+ '"' | not_space+
element ::= (logic_op? comp_op? value)+

where not_quote is any character except the quote character ("), and not_space is any character except white space (space, tab, or new line).

Now the documentation of the boost website for Spirit gives a great, easy to follow introduction [_3]. The management summary equivalent goes something like this:

  • a parser is made up from rules

  • rules are place holders for expressions

  • expressions are either primitives or combinations

Spirit provides classes that define rules and parsers. It also provides a fairly complete set of primitives. The main primitives used for this example are spirit::str_p and spirit::ch_p. str_p matches a string, and ch_p matches a single character.

Expressions can be grouped with brackets, alternatives defined by | (bar character), and combined using the >> operator. The bar operator is overloaded in Spirit allowing us to not explicitly wrap alternatives in constructor calls. This is a convenience especially when trying to fit examples in a small text area.

The first two grammar components are quite simple. For now just accept that what is being assigned is some form of rule class and the declaration will come later.

comp_op = spirit::str_p("<>") | "<=" | "<" |
   ">=" | ">" | "=" | "!=";
logic_op = spirit::ch_p('+') | '|';

The quirky parts of this are that the expressions are evaluated in a short circuit manner, so for the comparison operators you need to list the longest first, so <> needs to come before < otherwise the < will be matched for that expression. The Spirit library does provide a way to get around the short circuit nature with a directive. Directives could be thought of as modifiers to an expression. Here use of the longest_d directive would suffice, which would give:

comp_op = spirit::longest_d[ 
   spirit::str_p('<') | '<=' | '<>' | '>' |
   '>=' | '=' | '!=' ]

However the choice was to go with the simpler definition and a comment.

Now for the value rule. Some of the predefined character parsers were used for this.

ch_p('"') matches the quote character, ~ch_p('"') matches any character except the quote character, and +(~ch_p('"')) matches one or more non-quote characters. So the first part of the value is

'"' >> (+(~spirit::ch_p('"'))) >> '"' 

The alternative to a quote enclosed string is a single word, where the contents of the word is anything that isn't whitespace. Spirit provides a space_p that matches whitespace characters, so ~space_p will match non-whitespace characters. To make a word, we use

(+(~spirit::space_p))

Most of the time when parsing, whitespace is ignored, however in this case whitespace matters. This rule as it stands actually matches the string "a b c d" as "abcd". In order to tell the parser that we are concerned about the whitespace, we use the directive lexeme_d. The full rule for value is then:

value = '"' 
     >> (+(~spirit::ch_p('"')))
     >> '"' 
     | spirit::lexeme_d[(+(~spirit::space_p))];

The element then is an accumulation of the other rules. operator! is used as zero or one, so the element is then

element = +(!logic_op >> !comp_op >> value);

The complete definition for the grammar object is then:

struct query_grammar : public spirit::
   grammar<query_grammar>
{
  template <typename ScannerT>
  struct definition
  {
    definition(query_grammar const& self)
    {
      // short circuit, so do longer 
      // possibilities first
      comp_op = spirit::str_p("<>") | "<=" 
         | "<" | ">=" | ">" | "=" | "!=";
      logic_op = spirit::ch_p('+') | '|';
      value = '"' 
         >> (+(~spirit::ch_p('"'))) 
         >> '"' | spirit::lexeme_d[
         (+(~spirit::space_p)) ];
        element = +(!logic_op >> 
           !comp_op >> value);
        BOOST_SPIRIT_DEBUG_RULE(comp_op);
        BOOST_SPIRIT_DEBUG_RULE(logic_op);
        BOOST_SPIRIT_DEBUG_RULE(value);
        BOOST_SPIRIT_DEBUG_RULE(element);
      }
    spirit::rule<ScannerT> comp_op, 
       logic_op, value, element;
    spirit::rule<ScannerT> const& start() 
       const { return element; }
  };
};

The BOOST_SPIRIT_DEBUG_RULE macro enables some very useful debugging output which is handy when tracing your grammar if it is going wrong. A quick interactive test program allows us to test the grammar.

int main()
{
   std::cout << "> ";
   std::string input;
   std::getline(std::cin, input);
   query_grammar parser;
   while (input != "quit") {
      if (spirit::parse(input.c_str(), parser,
         spirit::space_p).full)
         std::cout << "parse succeeded";
      else
         std::cout << "parse failed";
      std::cout << "\n> ";
      std::getline(std::cin, input);
   }
}

Ths was used to prove that the grammar was correct. The next challenge is how to get the parser to populate the vector of SQLToken objects while parsing? I want the SQLToken object to be populated during parsing and, once a complete token has been processed (!logic_op >> !comp_op >> value), it should be pushed on to the vector.

The interesting part of handling assignment is that the definition struct constructor takes a constant reference to the outer grammar structure, so you cannot change normal member variables. This leaves the choices of mutable and references, and personally I tend to shy away from mutable where there is another choice. So the outer grammar stuct holds references to objects that we want to populate.

struct query_grammar : 
   public spirit::grammar<query_grammar>
{
    // definition structure here... 
    query_grammar(std::vector<SQLToken>&
       tokens, SQLToken& token)
       : tokens_(tokens), token_(token) {}
    std::vector<SQLToken>& tokens_;
    SQLToken& token_;
};

The next step is to add the actions to the rules, and this is done through the use of "actors". There are a number of predefined actors. The main one used here is assign_a. The function call operator on this actor takes one or two parameters. The first parameter is a reference to the string object to populate. If the second parameter is passed in, it assigns the second parameter to the first, and if not, the text that is matched for the rule is assigned.

There is the situation where we want to assign "and" when the parser finds '+', and "or" for '|', so the logic_op rule is changed to look like this:

logic_op = spirit::ch_p('+')[spirit::assign_a(
   self.token_.logic_op, "and")]
         | spirit::ch_p('|')[spirit::assign_a(
   self.token_.logic_op, "or")];

Since the action is being used on the components of the rule, the definition now has to specify ch_p('|') instead of just '|', as there is no operator[] on a char.

For the value, if it was quote enclosed, the value is the contents of the string without the quotes, otherwise the value is the whole single word, so the actor is applied to the parts of the value rule, not on the rule as a whole.

value = '"' 
   >> (+(~spirit::ch_p('"'))) 
   [spirit::assign_a(self.token_.value)]
   >> '"' | spirit::lexeme_d[
   (+(~spirit::space_p))
   [spirit::assign_a(self.token_.value)] ];

The comparison operator can be handled at the whole rule level as the text of the parsed rule is the string value that we want to store for the SQLToken. This is achieved by specifying the action for the comp_op rule in the element.

element = +(!logic_op
   >> !(comp_op[spirit::assign_a(
   self.token_.comp_op)]) >> value);

The last part of the parsing is to add the token to the vector. One way of doing this is through a functor object. Standard Spirit functors need to handle two char const* parameters. These are the start and end of the "match" for the rule. In this case they aren't used at all, but instead the functor operates on the references that it is constructed with.

struct push_token
{
   push_token(std::vector<SQLToken>& tokens,
      SQLToken& token) : tokens_(tokens),
      token_(token) {}
   void operator()(char const*, 
      char const*) const
     {
       tokens_.push_back(token_);
       // reset token_ to blanks
       token_ = SQLToken();
     }
   std::vector<SQLToken>& tokens_;
   SQLToken& token_;
};

To incorporate this functor into our element rule, we specify it as the action and construct it with the same references as the grammar.

element = +(!logic_op
   >> !(comp_op[spirit::assign_a(
   self.token_.comp_op)])
   >> value)[push_token(self.tokens_,
   self.token_)];

Now it's done. After testing the results, which to my initial surprise worked perfectly, the old function was replaced with this:

namespace {
  using namespace boost;
  struct push_token
  {
    push_token(std::vector<SQLToken>& tokens,
       SQLToken& token) : tokens_(tokens), 
       token_(token) {}
    void operator()(char const*, 
       char const*) const
    {
      tokens_.push_back(token_);
      // reset token_ to blanks
      token_ = SQLToken();
    }
    std::vector<SQLToken>& tokens_;
    SQLToken& token_;
  };
  struct query_grammar : public spirit
     ::grammar<query_grammar>
  {
    template <typename ScannerT>
    struct definition
    {
      definition(query_grammar const& self)
      {
        // short circuit, so do longer
        // possibilities first
        comp_op = spirit::str_p("<>") | "<=" |
           "<" | ">=" | ">" | "=" | "!=";
        // + -> and, | -> or.  Could now 
        // easily add in "and" and "or"
        logic_op = spirit::ch_p('+')[spirit
           ::assign_a(self.token_.logic_op,
           "and")] | spirit::ch_p('|')[spirit
           ::assign_a(self.token_.logic_op,
           "or")];
        // values are single words or 
        // enclosed in quotes.
        value = '"' >> (+(~spirit::ch_p('"'))) 
           [spirit::assign_a(self.token_.value)]
        >> '"' | spirit::lexeme_d[
           (+(~spirit::space_p))
           [spirit::assign_a(self.token_.value)]
           ];
        // EBNF:  (logic_op? comp_op? value)+
        // parsing fails if there are no values.
        element = +(!logic_op
        >> !(comp_op[spirit::assign_a(
           self.token_.comp_op)])
        >> value)[push_token(self.tokens_,
           self.token_)];
      }
    spirit::rule<ScannerT> comp_op, logic_op,
       value, element;
    spirit::rule<ScannerT> const& start() const
       { return element; }
  };
  query_grammar(std::vector<SQLToken>& tokens,
     SQLToken& token)
     : tokens_(tokens), token_(token) {}
     std::vector<SQLToken>& tokens_;
     SQLToken& token_;
  };
  std::vector<SQLToken> query_parse(
     char const* input)
  {
    Logger logger("gds.query.engine.parse");
    GDS_DEBUG_STREAM(logger) 
       << "query_parse input: " << input;
      std::vector<SQLToken> tokens;
      SQLToken token;
      query_grammar parser(tokens, token);
      if (spirit::parse(input, parser, 
         spirit::space_p).full)
      {
        if (logger.isDebugEnabled()) {
          for (unsigned i = 0;
          i < tokens.size();
          ++i) GDS_DEBUG_STREAM(logger) 
          << tokens[i];
        }
      }
      else
      {
        GDS_DEBUG(logger, "parse failed");
        tokens.clear();
      }
    return tokens;
  }
} // anon namespace

An anonymous namespace is used instead of the old static C function, some logging was added using our logging classes, but apart from that, the code went in without other modifications.

In total, I achieved a reduction of about 40 lines of code, which in itself is completely meaningless. The general complexity of the code increased, but at least in my opinion, it is now more maintainable and extensible. Should the client want to make modifications to the grammar it is now a relatively simple operation compared to the nightmare of altering the original embedded switch statements.

Special thanks to Phil Bass and David Carter-Hitchin for reviewing this article.

Overload Journal #69 - Oct 2005 + Programming Topics