String Tokenization - A Programmer’s Odyssey

This is an article that I have been writing and rewriting over a considerable period of time. While debating how to best present it, I realised that it was as much an article about my development as a C++ programmer, as about the tokenization of strings.

One of the common idioms required in programming is the extraction of tokens from textual information. Over time I have used various methods of tokenizing strings from use of strtok in C (and C++) through to the tokenizer class presented in this article. This article discusses the evolution of this class and how it tracks my development and understanding of the C++ language and the associated STL.

Early C++ Years

After first introductions to OO concepts in Object Pascal, my first contact in developing a major application in C++ used an early version of Visual C++. This used the familiar and comfortable strtok function to tokenize strings. The first developments along the path to current form of the tokenizer class was the desire to move away from using these C hangovers and use something more suitable for the brave new world of OO development, the known issues presented in using C strtok functions such as re-entrancy had an obvious impact on this desire. This first incarnation used a simple iterator style interface offering the following tokenization loop style.

for(Tokenizer iter(my_string); !iter.IsDone(); iter.Next()){
// do something with token
  cout << iter.Token();
}

The class provided the following constructors:

Tokenizer(const CString& string, CString separator = _T(","),BOOL removeSpaces = TRUE);
Tokenizer(const CString& string, CString separator, TCHAR delimiter, 
                                  BOOL removeSpaces = TRUE);

One of the defects in this first attempt was the lack of a pointer style access interface, but this was easily rectified by providing the appropriate operator* and operator-> methods. Methods for operator++ (both prefix and postfix) followed in quick succession to further refine the interface. Further refinements to the class included the ability to handle set tokenization (where each token set was separated by a different separator to that which separated the tokens within a set, e.g 1, red, car: 2, yellow, lorry).

STL Intervention

Striving to keep up with C++ developments it was with some relief when I was finally able to start using a version of Visual C++ which supported the STL and provided the STL as part of its repertoire. A major aim of mine is generating code that is as generic as possible; as a result I started using basic_string more and more in preference to the MFC CString class. To support this I simply migrated the original tokenizer class from using the MFC CString to basic_string .

As my understanding of the logic behind the STL and C++ templates improved (I had previously used Generics in ADA so already understood the basic concepts), I was convinced of the benefits in ensuring STL extension classes used a syntax similar to the STL format. This required two steps, firstly to convert the class to a template so as to match the underlying basic_string definition, and secondly to match the STL iterator style interface. After these changes a loop looked like this:

basic_tokenizer<basic_string<char> > tokenizer(my_string);
typedef basic_tokenizer<basic_string<char> >::iterator iterator;
for (iterator iter(tokenizer.begin()); iter != tokenizer.end(); ++iter){
// do something with token
  cout << *iter;
}

With the following class constructors definition:

template <class T>
class basic_tokenizer{
  basic_tokenizer(const T& string, T separator = _T(","),  bool removeSpaces = true);
  basic_tokenizer(const T& string, T separator, TCHAR delimiter, 
                        bool removeSpaces = true);
    :
    :

Using templates had the immediate advantage that the class now supported the use of any string type class that conformed the STL basic_string interface, such as the SGI rope class.

STL Conformance

A remaining issue in this interface that gave me cause for concern was that as the tokenizer was effectively an iterator it should be possible to use it in the standard STL algorithm functions such as copy .

While considering the issues I happened across two articles describing alternative implementations of tokenizer classes ^{[

1

]} and at this point I considered abandoning my class and using one of these in preference. On reflection I believed that these both had problems of their own with their STL iterator syntax use, so I examined ways to incorporate the desired improvements in the next iteration of my own class.

In summary the design goals for the next version of the class were:

Support any basic_string conformant type
Use only standard methods for implementation
Support STL iterator syntax
Support STL algorithms (only as input iterator)
Use functors for the tokenization function to allow for replacement tokenization methods

The first goal is simply achieved by ensuring that the template definition does not use any prior knowledge of the string type, but simply requiring that the template parameter conform to the standard STL basic_string interface. The class can then use the standard informational types internally from the provided string type class. Additionally the class requires the provision of a token finder 'functor' class which will be used to extract the tokens from the string being tokenized (details of this are beyond the scope of this article and details can be found in the source code)

template <class T, class F = basic_finder<T::value_type> >
class basic_tokenizer : public std::iterator<std::forward_iterator_tag, T>...

The class is derived from the base STL iterator class so that it can be used much more in keeping with STL iterators and so a tokenization loop will look as follows ( note the tokenizer is designed to allow the test for loop end to be made against the string end function directly ).

typedef sfx::util::basic_tokeniser<std::string> tokeniser;
for (tokeniser tok = test1.begin(); tok != test1.end(); ++tok){  cout << *tok << endl;}

In supporting use of the tokenizer in STL algorithms it needs to be understood that the tokenizer class provides an iterator adapter. The tokenizer is then used to adapt both the begin and end iterators from the underlying string class for use in an STL algorithm, like this:

copy(tokenizer(my_str.begin()), tokenizer(my_str.end()), output_iterator);

Note that the tokenizer is specifically derived from a const_iterator to ensure that it can only be used in algorithms where input iterators are allowed as this is the only form that make sense for a tokenization sequence.

Final Words

After completing this article I came across the updates to the one of the tokenizer classes I had previously investigated within the C++ Boost community. I have not had chance to study this in detail yet but decided that the article was worthwhile even in respect of this development.

During the long gestation of this class I have learnt enormous amounts about template concepts and their effective use, and how the STL fits together and can be extended naturally. It is critical when designing STL compatible extension classes to ensure that both class syntax and use are familiar to STL users. If you have comments on the design or other aspects of the class then please drop me an email at the address below. Source code for the class can (shortly) be found on my website http://www.wilsonsonline.org .

^{[

1

]} A Generic Iterator for Strings, David Lorde, C/C++ Users Journal April 1999. The Token Iterator, John R. Bandela, http://www.codeproject.com/cpp/tokeniterator.asp

Early C++ Years

STL Intervention

STL Conformance

Final Words

Advertisement

Advertisement

Your Privacy

String Tokenization - A Programmer's Odyssey

Early C++ Years

STL Intervention

STL Conformance

Final Words

Advertisement

Advertisement

Your Privacy