pinFine Tuning for lexical_cast

Overload Journal #74 - Aug 2006 + Programming Topics   Author: Alexander Nasonov
Alexander Nasonov takes a look at Boost's lexical_cast and addresses a common user request: "make it go faster".

A few weeks ago I created a patch that reduces a size of executable when arrays of different sizes are passed to lexical_cast. I sent the patch to Kevlin Henney and to a maintainer of lexical_cast Terje Slettebo. To my surprise, nobody is actively maintaining lexical_cast anymore. Terje kindly asked me to take a maintainership and I did.

To those of you who have never used boost::lexical_cast before, this is a function that transforms a source value of one type into another type by converting the source value to a string and reading the result from that string. For example, if the first argument of a program is an integer, it can be obtained with int value = lexical_cast<int>(argv[1]) inside the main function. For more information, refer to [Boost].

There were several requests from users and I, as the maintainer, should consider them. The first request that I recalled was to improve the poor performance of the lexical_cast. Inspection of the boost/lexical_cast.hpp file showed that an argument of lexical_cast is passed to an internal std::stringstream object and a return value is constructed by reading from that std::stringstream object. It is well known among C++ programmers that a construction of std::stringstream is slow. This affects performance significantly because the object is created to format one value every time the lexical_cast function is called.

In the general case, not much can be improved but specialized conversions can be optimized. One example is where this can be done is lexical_cast<char>(boolean_value), which could be expanded to '0' + boolean_value if this combination of types was supported.

Another example is a conversion from an integral type to std::string. It could be as fast as

std::string s = itoa(n); // n is int

I wish I could add a specialization that handled this case but itoa is not a standard function. The sprintf function might be an option but it formats differently compared to std::ostringstream and it also might be painful to mix C and C++ locales.

Sounds like I should implement a function similar to itoa. No problem. Though it's always good to study the subject before coding.

Some might argue that coding activity should be driven by tests. Well, that's a great approach but even having an excellent test suite doesn't guarantee that your code is bug-free. One goal of code analysis is to explore new test cases. You will easily recognize them during the course of the article.

Actually, the lexical_cast is shipped with a test (see libs/conversion/lexical_cast_test.cpp in Boost tree). It will help not to make a silly mistake but is not enough for checking hand coded conversion from an integral type to std::string because the test relies on correctness of std::stringstream code.

Analysis of several itoa implementation

There are several open-source UNIX distributions available ([FreeBSD], [NetBSD] and [OpenSolaris]) that include a code of itoa-like functions. I should make it clear that I'm not going to copy code between projects. Even if both projects are open-source, there are might be some incompatibilities between licences that don't let you redistribute mixed code under one licence. I only analyzed algorithms and pitfalls of implementations.

Some implementations of itoa look similar to this code:

char* itoa(int n)
{
    static char buf[12]; // assume 32-bit int
    buf[11] = '\0';
    char *p = buf + 11;
    int negative = n < 0;
    if(negative)
        n = -n;
    do {
        *--p = '0' + n % 10;
        n /= 10;
    } while (n);
    if(negative)
        *--p = '-';
    return p;
}
    

This code is plainly wrong. More precisely, -n has an implementation-defined behaviour. On 2s-complement systems, a negation of INT_MIN overflows. On systems where integer overflow doesn't interrupt a program execution, a result of INT_MIN negation is INT_MIN.

This fact has an interesting consequence for the algorithm. The first argument of the n % 10 expression is negative, therefore, a result of this expression is implementation-defined as per section 5.6 bullet 4 of the standard [1]:

If both operands are nonnegative then the remainder is nonnegative; if not, the sign of the remainder is implementation-defined

On my FreeBSD-powered Pentium-M notebook, this function returns:"-./,),(-*,(". This is definitely not a number!

So, I threw this version away and went to the next implementation:

char* itoa(int n)
{
    static char buf[12]; // assume 32-bit int
    buf[11] = '\0';
    char *p = buf + 11;
    unsigned int un = n < 0 ? -n : n;
    do {
        *--p = '0' + un % 10;
        un /= 10;
    } while (un);
    if(n < 0)
        *--p = '-';
    return p;
}
    

It still applies unary minus to int type but the result is immediately transformed into unsigned int. This makes a big difference compared to the previous code. According to [1] 4.7/2, a conversion of some value v to unsigned type is the least congruent to v modulo 2^N, where N is a number of bits used to represent values of the target type.

In other words, v belongs to one equivalence class with v + 2^N, v + 2^N + 2^N, v + 2^N + 2^N + 2^N and so on. By common convention, values within [0, 2^N) are used to represent classes of equivalent values.

Negative numbers are outside of this range and therefore, 2^N should be added to convert them to unsigned type1. For example, n=INT_MIN in 2s-complement representation is -2^(N-1). This number is equivalent to -2^(N-1) + 2^N = 2^(N-1). Therefore, un is equal to an absolute value of n.

To get rid of implementation-defined behaviour of the -n expression, the line:

unsigned int un = n < 0 ? -n : n;
    

can be replaced with:

unsigned int un = n;
if(n < 0)
    un = -un;
    

It has already been discussed that a conversion in the first line has well-defined behaviour. Negative number n is converted to unsigned value n + 2^N, other values become unsigned by trivial conversion.

The third line operates on unsigned type and therefore, obeys modulo 2^N arithmetic as per 3.9.1/4 of [1]. This means that a value of un after assignment is equal to -(n + 2^N) = -(n + 2^N) + 2^N = -n. Since n is negative, -n is positive and we can conclude that un is equal to an absolute value of n.

A reader might wonder why we don't just write unsigned int un = abs(n)? The answer is the same. When an absolute value of n cannot be represented by int, a result of abs function is implementation-defined.

Let's move on to the next problems. Illustrated functions define static char buf[12].

First problem is that it is assumed that int's representation is not bigger than 32 bits. A size of buf should be correctly calculated for all possible representations:

static char buf[3 +  std::numeric_limits<int>::digits10];
    

The number 3 is a sum of 1 byte for minus sign, 1 byte for trailing '\0' and 1 byte for a digit not counted by digits10 constant.

A more generic formula for arbitrary integral type T is:

2 + std::numeric_limits<T>::is_signed + std::numeric_limits<T>::digits10
    

The second problem with buf is that it is shared between invocations of this function. You cannot call itoa from two threads simultaneously because both calls would try to write at the same location - to the buf array.

There are several solutions. One can pass a pointer to a buffer of sufficient length to itoa:

void itoa(int n, char* buf);
    

It introduces one inconvenience, though. Original itoa is very handy for code like s += itoa(n). This modification would definitely break it because prior to making a call, a buffer variable should be defined. For the very same reason calling lexical_cast in-place is better then defining std::ostringstream variable and performing formatting.

The idea is to return an array from itoa. Builtin C/C++ array cannot be copied, so it should be wrapped into struct:

struct itoa_result
{
    char elems[12]; // assume 32-bit int
};
itoa_result itoa(int n);
    

The technique of wrapping an array into a struct can be generalized in C++. We don't have to do this though because boost::array2 is (quoting Boost documentation) STL compliant container wrapper for arrays of constant size:

boost::array<char, 3 +
    std::numeric_limits<int>::digits10> itoa(int n);
    

A typical call of this function would look like this:

s += itoa(n).elems;
    

or, if you'd like to save a result in a variable that has a wider scope than that of the temporary return value of itoa, take this route:

char buf[sizeof itoa(n).elems];
std::strcpy(buf, itoa(n).elems);
    

It's interesting to note how buf is defined in the last example. There is no magic formula to calculate a capacity to hold the int value, just the sizeof applied to an array returned by itoa. It is much easier to remember this trick then to recall the formula.

Unless you count every byte in buf, an access to elems in sizeof expression can be omitted:

char buf[sizeof itoa(n)];
    

This might increase the size of buf slightly but this is not dangerous, while it saves typing 6 characters.

Other improvements that can be applied to itoa are consistency with C++ streams and support for other integral types.

The first improvement is printing digits in groups delimited by a thousands separator if the current locale has grouping. It is worth noting that this increases the size of the buffer. Since the thousands separator is a char, this change doesn't add more then std::numeric_limits<int> ::digits10 characters to the buffer.

Integral types other then int are not different from an implementation point of view. Although it's tempting to turn itoa into a function template, there is a good reason to leave it and add overloads for other integral types instead. The fact is that some types cannot be used in templates. Such things as bit fields and values of unnamed or local enumeration types are incompatible with a template machinery.

That's it. Definitely too much for the patch but we created a good ground for another library.

Tuning for lexical_cast

As shown in a previous section, string representations of some source types have a limited length that can be calculated at compile-time. For every such type, we define a specialization of the lexical_cast_src_length metafunction that returns that limit. For example:

template<> struct lexical_cast_src_length<bool>    
: boost::mpl::size_t<1> {};
 template<> struct lexical_cast_src_length<int>
     : boost::mpl::size_t<2 +
         std::numeric_limits<int>::digits10> {};
    

A source value can be placed in a local buffer buf of appropriate length without the overhead of std::stringstream class, e.g. a lexical conversion of bool value could be a simple buf[0] = '0' + value expression, likewise itoa could format an int value and so on.

Fine, the output phase is completed without calling to C++ iostream facilities. Next is the input phase. Unlike the output phase, where a set of source types is limited to those that have a specialization of lexical_cast_src_length, a target type can be any InputStreamable3 type. This means that we have to use C++ streams. The implementation is straightforward. First of all, a streambuf object is constructed, then its get area (setg member function) is set to point to a location of formatted source value as shown in Listing 1.

struct lexical_streambuf : std::streambuf
{
    using std::basic_streambuf<CharT>::setg;
};

// Adapted version of lexical_cast
template<class Target, class Source>
Target lexical_cast(Source const& arg)
{

  // Pretend that Source has a limited string
  // representation
  char buf[ lexical_cast_src_length<Source>::value ];
  char* start = buf;

  // Output phase ...
  // char* finish points to past-the-end of
  // formatted value

  // Input phase
  lexical_streambuf sb;
  std::istream in(&sb);
  sb.setg(start, start, finish);
  Target result;
  if(!( in >> result) || in.get() !=
        std::char_traits<char>::eof())
    throw bad_lexical_cast(typeid(Source),
                           typeid(Target));
  return result;
}
Listing 1

To solve a performance degradation of the generic solution, specialized versions can be added. For example, if Target is std::string, the input phase can be implemented in a single line of code:

return std::string(start, finish);
    

Testing

It's great that the library already has a test. We can safely assume that combinations of Source and Target types that are dispatched to old implementations are tested thoroughly. Some combinations of types that trigger an execution of the new code are tested as well but we shouldn't rely on it.

The plan is to write down a list of Source and Target types that have an optimized implementation. Then all combinations should be tested one by one. Implementation details should be taken into account so that there are no surprises like INT_MIN being formatted incorrectly or weird behavior for some locales.

Completeness of the test can be verified by running the test under control of a coverage tool. If some line is not hit, it's a good hint for a new testcase. There is one small problem with specializations of lexical_cast_src_length metafunction, though. They don't have executable code at all. How do you check that some particular specialization is in use? That's easy, just define an empty check_coverage function in each specialization and call it from a point of use of the metafunction.

Benchmark

A conversion from int digit to char has been measured on FreeBSD 6.1. The test has been compiled with gcc 3.4.4 with -O2 optimization flag turned on.

#include <boost/lexical_cast.hpp>
int main()
{
  int volatile result = 0;
  for(int count = 1000000; count > 0; --count)
    result += boost::lexical_cast<char>(count % 10);
  return result;
}
    

The next table shows execution times of the test compiled with Boost 1.33.1 and with the patched lexical_cast. The last column is the performance influence of not constructing std::locale and std::numpunct<char> objects and not calling std::numpunct<char>::grouping function. This change is not included into the patch because it is not consistent with the standard output operator for int.

Boost 1.33.1 Patch Ratio Patch that ignores locales
2.012 sec 0.322 sec 6.24 0.083 sec

More common is a conversion to std::string. To measure performance of this conversion, the for body in the test program has been replaced with

result +=
  boost::lexical_cast<std::string>(count).size();
    

The results are:

Boost 1.33.1 Patch Ratio Patch that ignores locales
2.516 sec 0.844 sec 2.98 0.626 sec

Conclusion

After careful tuning, the lexical_cast can run several times faster. I hope that future versions of Boost will have an improved version of . It will likely happen in version 1.35.

References

[Abrahams] C++ Template Metaprogramming by David Abrahams and Aleksey Gurtovoy, ISBN 0-321-22725-5

[Boost] The Boost Library, http://www.boost.org

[FreeBSD] The FreeBSD Project, http://www.freebsd.org

[NetBSD] The NetBSD Project, http://www.netbsd.org

[OpenSolaris] The OpenSolaris Project, http://www.opensolaris.org

[Standard] ISO/IEC 14882, Programming Languages - C++, Second Edition 2003-10-15

1. This statement is not correct if the source type has more bits then N. For such conversions, 2^N should be added repeatedly until the result is within [0, 2^N) range.

2. The class is under discussion for inclusion into the next version of the standard; see N1836, Draft Technical Report on C++ Library Extensions.

3. A more correct list of requirements is InputStreamable, CopyConstructible and DefaultConstructible

Overload Journal #74 - Aug 2006 + Programming Topics