A Case Against Blind Use of C++ Parallel Algorithms

C++17 introduced parallel algorithms. Lucian Radu Teodorescu reminds us we need to think when we use them.

We live in a multicore world. The hardware free lunch is over for about 15 years [Sutter05]. We cannot rely on hardware vendors to improve the single-core performance anymore. Thus, to gain performance with hardware evolution we need to make sure that our software runs well on multicore machines. The software industry started on a trend of incorporating more and more concurrency in the applications.

As one would expect, the C++ standard has also started to provide higher level abstractions for expressing parallelism, moving beyond simple threads and synchronisation primitives. Just for the record, I don’t count std::future as a high-level concurrency primitive; it tends to encourage a non-concurrent thinking, and, moreover, its main use case almost implies thread blocking. In the 2017 version of the standard, C++ introduced the so-called parallel algorithms. In essence, this feature offers parallel versions of the existing STL algorithms.

This article tries to cast a critical perspective on the C++ parallel algorithms, as they were introduced in C++17, and as they are currently present in C++20. While adding parallel versions to some STL algorithms is a good thing, I argue that this is not such a big advancement as one might think. Comparing the threading implications of parallel algorithms with the claims I’ve made in [Teodorescu20a] and [Teodorescu20b], it seems that the C++ additions only move us half-way through.

A minimal introduction into C++ parallel algorithms

To form some context for the rest of the article without spending too much time on this, let’s provide an example on how to use a parallel STL algorithm.

Let’s assume that we have a transform algorithm, and we want to parallelise it. For that, one should write something like the code in Listing 1. The only difference to a classic invocation of transform is the first parameter, which, in this case, tells the algorithm to use parallelisation and vectorisation.

std::transform(std::execution::par_unseq,
                         // parallel policy
  in.begin(), in.end(),  // input sequence
  out.begin(),           // output sequence
  ftor);                 // transform fun

Listing 1

This parameter is called execution policy. It tells the algorithm the type of execution that can be used for the algorithm. In the current C++20 standard there are four of these parallel policies, as explained below:

seq: it will use the serial version of the algorithm, as if the argument was missing
par: the algorithm can be parallelised, but not vectorised
par_unseq: the algorithm can be parallelised and vectorised
unseq: the algorithm can be vectorised but not parallelised (introduced only in C++20)

So, to transform an existing algorithm into a parallel (or vectorised) version, one just needs to add an argument to specify the parallel policy; the effort is minimal.

Please note that the library is allowed to completely ignore the execution policy and fall back to the serial execution. Thus, this execution policy provides just a hint for the library, or the maximum parallelisation/vectorisation level allowed.

Most STL algorithms take the execution policy parameter and can be instructed to run in parallel. There are also some new algorithms that were added to overcome the fact that existing algorithms have constraints that forbid parallelising them, or that there are better ways to express some parallel algorithms: reduce, exclusive_scan, inclusive_scan, transform_reduce, transform_exclusive_scan, transform_inclusive_scane.

For a better introduction and explanation of C++ parallel algorithms, the reader should consult [Lelbach16] [Filipek17a] [Filipek17b] [ONeal18].

Most of our discussion will be focusing on the parallel execution (par policy), the one that aims to utilise all the cores available to increase efficiency. Vectorisation (unseq policy) will be briefly touched towards the end of the article.

Problem 1: no concurrency concerns

The first thing to notice is that it’s straightforward to adapt existing algorithms and make them parallel. This probably partially explains the success of parallel algorithms (at least at the perception level).

But this ease of use also has a negative side effect. The astute reader might have noticed that we are talking about parallelism, and not about concurrency. See [Pike13] and [Teodorescu20c] for the distinction. Very short, concurrency is a design concern, while parallelism is a run-time efficiency concern.

It is ok, for limited domains, to focus more on efficiency than design, but that’s typically not the case with concurrency. Unless one pays attention to concurrent design, one will get suboptimal efficiency. In other words, multicore efficiency is a global optimisation problem, not a local one.

C++ parallel algorithms don’t allow a global concurrency design. It allows only local optimisations, by making some algorithm calls parallelisable.

Considering this, things are awful from a didactical point of view: the C++ standard might teach people that one needs not to pay attention to concurrency issues; these would be magically solved by using STL. I hope that it’s clear by now that this is not the case.

If we generalise a bit, we may come to the conclusion that this is the same problem that led us to bad concurrency design in the first place. Instead of recognising that concurrency needs a completely new type of design, we tried to ‘patch’ the old imperative and serial thinking by adding the ability to run on multiple threads; see also Kevlin Henney’s discussion on synchronisation quadrant [Henney17]. To do proper concurrency, we need to drop the old ways of writing software, and adopt a new paradigm.

But maybe the C++ committee never intended to solve the concurrency problem, they only wanted to solve some local efficiency problems, and it’s just a misunderstanding of their goal. Let’s tackle efficiency concerns.

Problem 2: applications have more than algorithms

Remember Amdahl’s law? To have speed improvements from parallelism, one needs to have a significant part of the code that is parallelisable. In the case of Parallel STL¹, one needs to have significant parts of the application using such STL algorithms.

Let me repeat this: in order to have relevant performance benefits, the vast majority of the time needs to be spent in STL algorithms.

This is somehow hard to achieve for a lot of the applications. Not every program consists just in STL algorithm calls. Actually, many programs out there have flows that it are very hard to reduce to STL algorithms, if it’s even possible. Lots of these applications can only expose control flow concurrency, making them unsuitable for using parallel algorithms.

One example that I have in mind is applications that do a lot of graph processing. For most of the cases, there aren’t any STL algorithms ready to be used.

Problem 3: multiple algorithms introduce serial behaviour when combined

Let’s assume that to solve a particular problem one needs to call multiple STL algorithms. If the algorithms were to interact, then they need to be called in a serial fashion; i.e., call one algorithm, then call another, etc. The scenario is illustrated in Figure 1.

Figure 1

As one can see in the picture, there are three parts of the flow that are bound to execute serially: before the first algorithm, between the algorithm calls and at the end of the second algorithm. And, no matter how good the implementation is, there is some time spent in synchronising and starting tasks; thus, according to Amdahl’s law this will put an upper bound of the performance improvement.

But this is not even the worse part. Whenever the algorithm ends, it will have to wait for all the threads to finish executing. This is a consequence of the fact that the amount of work cannot be perfectly distributed between threads, and some threads will process more than others. Typically, this will make the threads stop executing real work, reducing the parallelism capacity of the machine.

To counter this behaviour, one must do one or more of the following:

start algorithms from multiple threads
ensure that algorithms have continuations, avoiding serially calling algorithms
tune the parallel algorithms to ensure that the work split if even.

The first item can be doable outside parallel STL but with no direct support. However, the current design of the standard library does not allow one to implement any of the following two items.

Problem 4: small datasets are not good for parallelisation

Considering Amdahl’s law and the overhead one gets from synchronising the exit of an algorithm, let us investigate when it makes sense to parallelise an algorithm. To make it worthwhile for an algorithm to be parallelised, the execution time of the algorithm needs to be somehow large. See my analysis in [Teodorescu20a]. On the type of applications I typically work on, I would say as a rule of thumb that the algorithm needs to take more than 100 milliseconds for being worthwhile to be parallelised.

Please excuse my over-generalisation, but it doesn’t seem too rewarding to try to parallelise a 100 millisecond algorithm. Obtaining a linear speedup on a 6-core machine would save 83.3 milliseconds on the completion time, and will fully utilise all the cores. Probably there are better optimisation opportunities.

For an algorithm to take a long time to execute, one of the following conditions must be true:

the number of elements in the collection that the algorithm operates on needs to be sufficiently large
the operation given to the algorithm needs to take a long time.

While the second condition can be true in many applications, the first one (numerous elements) is typically not true.

Before we move on, I would beg the reader to consider most of the applications that one worked on. How many of them had algorithms operating on too many elements (e.g, millions of elements), and how many of them had algorithms that are given operations that are long to execute (more than 100 milliseconds per functor call). I bet that there aren’t that many applications, especially in the first case.

Now, I’ve been saying numerous elements, but I haven’t been too precise. How many elements count as numerous?

To give an order of magnitude for this, let’s take the examples from [Filipek17b]. The first example considers a simple transform operation, with the functor just doubling the value received as a parameter; the algorithm is run over a vector of double numbers. For 100k elements, on his more powerful machine (i7 8700, 6 cores), the execution times goes down from 0.524 to 0.359 seconds. That’s just a 1.46x improvement for 6 cores. For this type of operation, it’s not worth parallelising it.

After this test, Bartlomiej writes [Filipek17b]:

As you see on the faster machine, you need like 1 million elements to start seeing some performance gains. On the other hand, on my notebook, all parallel implementations were slower.

Let’s consider his real world example: computing the Fresnel transform on pairs of position and normal vectors. On this example, it takes 100k elements to have the transformation go from 1.697 to 0.283 (that’s a 5.996x improvement; this is good).

Here, we need 100k elements for the performance improvement to be in the order of hundreds of milliseconds.

But, maybe Bartlomiej was having some odd and biased tests. Let’s also look at the benchmark that Billy O’Neal made [ONeal18]. After all, Billy implemented parallel algorithms in Microsoft Visual Studio’s standard library, so he must know better. His article presents multiple benchmarks on a sort algorithm. And in all the cases he uses 1 million (with all times reported for release configuration less than 100 ms).

So, as a rule-of-thumb, to see significant performance improvements from parallelising STL algorithms, one needs to have containers with a rough order of magnitude of 1 million elements. From my experience, this doesn’t happen too often.

Problem 5: cannot tune the algorithms

Although this problem applies to all the algorithms, let’s take the sort algorithm as a running example.

Sorting 100 integers is faster with a linear algorithm than with a parallel sorting algorithm. So, the algorithms typically have a cutoff point; if the number of elements is below this cutoff point, then the algorithm simply calls the serial version.

In the parallel sort implementation of Intel TBB, this cutoff point is 500 elements. In GCC implementation of parallel algorithms the cutoff point is still 500 elements. My own concore library [concore] also has a cutoff of 500 elements for the sort implementation. This is a common pattern.

But it’s not the same thing to sort integers, to sort strings or to sort some complex objects. The cutoff points need to be different. But the C++ standard doesn’t allow any tuning of the algorithms.

That is, either the cutoff point is too small for sorting integers, or too large to sort complex objects in parallel. One size does not fit all.

Similarly, let’s assume a simple for_each operation. For some cases, if the algorithm needs to call a heavy function, it’s ok to have each element mapped into a task. On the other hand, if the transformation is simple (i.e., an arithmetic operation) then, creating one task per element will be bad for performance. Thus, algorithms may need tuning to accommodate different input patterns.

Listing 2 presents a small snippet using concore library [concore] showing how one can set hints to parallel algorithms.

concore::partition_hints hints;
hints.granularity_ = 100;
  // cannot process less than 100 elements in 
  //a single task
concore::conc_for(itBegin, itEnd, operFtor,
  hints);

Listing 2

As mentioned above, the current C++ standard doesn’t allow any tuning of the algorithms. This leads to suboptimal performance.

Other notes

Compile time regressions

All the algorithms in the standard library are heavily templatised, and thus so are the parallel versions of the algorithms. They all need to be implemented in C++ headers. But, implementing these parallel versions is far more complicated than implementing the serial versions. Thus, the parallel versions of the algorithms add significant compile-time overhead. See [Rodgers18] for some GCC implementation notes, mentioning this problem.

This is not a problem to be taken lightly. It seems that it takes more and more to compile C++ programs. In a lot of code-bases that I’ve seen the compilation cycles are extremely demoralising; not to mention how bad it is to apply Test-Driven Development in projects that take forever to compile.

Luckily for us, there are a few tricks that library implementers can employ to save part of this problem. For example, GCC will not compile the parallel versions of the library if <execution> is not included [Rodgers18].

More work to finish faster

This might surprise some readers, but the parallel versions of some algorithms need more work than their serial counterpart to achieve the same results. That is, we do more work in the hope that parallelisation will make the algorithms finish faster. It’s a tradeoff between throughput and latency. exclusive_scan and inclusive_scan are prime examples of such algorithms.

This is not a weakness of the C++ parallel algorithms, but a fundamental limitation of these parallel algorithms.

Quick change of policy is an advantage

The C++ algorithms are easily turned on by changing (or adding) a parameter to the function call. So, an easy switch can improve the performance of some applications. This is a good thing.

Vector level parallelism is OK

All the discussion so far was focused on parallel algorithms that perform work on multiple cores. I.e., the par execution policy. Let’s briefly turn our attention to the vectorisation (unseq policy).

If one cannot achieve good parallelism without having a more global perspective, to efficiently use vectorisation, one typically must focus on the local computations. This local focus makes it perfect for vectorisation to be applied at the STL algorithms level.

This can potentially unlock a larger portion of the computation power available on modern computers [Parent16].

Conclusions

C++ parallel algorithms bring vectorisation to common algorithms, without requiring a lot of effort from the user. This is great! C++ parallel algorithms also bring parallel versions of the algorithms without requiring a lot of effort from the user. However, this is not that great. It is a tool that can be useful in some cases, but it’s not great.

First, there is a central design issue. Moving to multicore programming requires a somehow-global concurrency design. C++ parallel algorithms don’t offer any support for this. Moreover, the standard sends a bad signal to all the C++ programmers that somehow parallelism can be achieved without concurrent thinking. I can’t stress enough how bad is this.

As applications have more than just algorithms, we argue that using C++ parallel algorithms is not enough to achieve good speedups to most of the applications.

Then, we walk over some other problems, related to lower-level performance issues: multiple algorithms are serialised, typical small datasets make performance gains small, impossibility of tuning the algorithms. So, besides the high-level concurrency design issue, we have efficiency hurdles at low-level too.

It is true that it’s easy for a user to quickly change the policy of the STL algorithms and maybe get some performance benefits. But my focus in this article is on the empty half of the glass. I’m arguing that the benefits are not as big as one could obtain with a proper concurrency design. In some limited cases (i.e., many elements, or functors that are too complex) one might get some speedups for one’s algorithms. But even in these cases, the costs of spiking into using multiple cores may have an overall negative performance costs.

All these make C++ parallel algorithms not such a great addition in the concurrency toolkit. They are ok, they are needed, but it’s still not enough. I would argue that basic executors support would have been a better addition to the standard. But the executors didn’t make it into the 2020 standard. So, let’s wait to see what the future will reserve us.

References

[concore] Lucian Radu Teodorescu, Concore library, https://github.com/lucteo/concore

[Filipek17a] Bartlomiej Filipek, C++17 in details: Parallel Algorithms, 2017, https://www.bfilipek.com/2017/08/cpp17-details-parallel.html

[Filipek17b] Bartlomiej Filipek, The Amazing Performance of C++17 Parallel Algorithms, is it Possible?, 2017, https://www.bfilipek.com/2017/08/cpp17-details-parallel.html

[Henney17] Kevlin Henney, Thinking Outside the Synchronisation Quadrant, ACCU 2017 conference, 2017, https://www.youtube.com/watch?v=UJrmee7o68A

[Lelbach16] Bryce Adelstein Lelbach, C++ Parallel Algorithms and Beyond, CppCon 2016, 2016, https://www.youtube.com/watch?v=Vck6kzWjY88

[ONeal18] Billy O’Neal, Using C++17 Parallel Algorithms for Better Performance, Microsoft C++ Team Blog, 2018, https://devblogs.microsoft.com/cppblog/using-c17-parallel-algorithms-for-better-performance/

[Parent16] Sean Parent, Better Code: Concurrency, code::dive 2016 conference, 2016, https://www.youtube.com/watch?v=QIHy8pXbneI

[Pike13] Rob Pike, Concurrency Is Not Parallelism, https://www.youtube.com/watch?v=cN_DpYBzKso

[Rodgers18] Thomas Rodgers, Bringing C++ 17 Parallel Algorithms to a Standard Library Near You, CppCon 2018, 2018, https://www.youtube.com/watch?v=-KT8gaojHUU

[Sutter05] Herb Sutter, ‘The free lunch is over: A fundamental turn toward concurrency in software’, Dr. Dobb’s journal, 2005

[Teodorescu20a] Lucian Radu Teodorescu, ‘Refocusing Amdahl’s Law’, Overload 157, June 2020

[Teodorescu20b] Lucian Radu Teodorescu, ‘The Global Lockdown of Locks’, Overload 158, August 2020

[Teodorescu20c] Lucian Radu Teodorescu, ‘Concurrency Design Patterns’, Overload 159, October 2020

Footnotes

I’m using Parallel STL interchangeably with C++ parallel algorithms.

Lucian Radu Teodorescu has a PhD in programming languages and is a Software Architect at Garmin. He likes challenges; and understanding the essence of things (if there is one) constitutes the biggest challenge of all.