Orchestrating concurrent tasks using mutexes is seldom efficient. Lucian Radu Teodorescu investigates design patterns that help unlock concurrent performance.
If you are a reader of Overload, then you probably know by now that mutexes should be avoided and tasks can be a viable alternative to them. If you are not an Overload reader, you are missing out ☺.
In the last two articles [Teodorescu20a] [Teodorescu20b], I tried to show that using tasks instead of mutexes is more performant, is safer and they can be employed in all the places that mutexes can. Tasks are not the only alternative to mutexes, but this seems to be the most general alternative; to a large degree, one can change all programs that use mutexes to use tasks. In general, using tasks, one shifts focus from the details of implementing multithreaded applications to designing concurrent applications. And, whenever the focus is on design, we can be much better at the task at hand – design is central to the software engineering discipline.
But, as tasks are not very widespread, people may not have sufficient examples to start working with tasks instead of mutexes. This article tries to help with this by providing a series of design patterns that can help ease the adoption of task systems, and that may, at the same time, improve general concurrency design skills. Even more fundamentally, it tries to show how applications can be designed for concurrency.
Concurrency vs. parallelism
There is widespread confusion in the software industry between concurrency and parallelism. Before moving forward with the article, it’s worth clarifying the distinction.
Parallelism is about running multiple things in parallel; concurrency is about ensuring that multiple things can run in parallel (or, more correctly, at the same time), the composition of independent processes1. For parallelism, one needs to have at least two CPU cores; on the other hand, concurrency can be expressed on a single core too. Having two threads doesn’t imply parallelism, but it implies concurrency. So, parallelism implies concurrency, but not the other way around. To benefit from parallelism, one needs to design for concurrency.
If we look from a performance point of view, one wants parallelism, but needs to design the code for concurrency. If we look from a design point of view, then we should mostly be only concerned with concurrency. At design time, it’s not clear if at run-time one will have the hardware to run things in parallel. The goal of concurrency is to structure programs, but not necessarily to make them run with more parallelism – in the best case, it is an enabler for parallelism.
See [Pike13] for a better-articulated distinction between the two concepts.
We are focusing here on the design aspects, on how to express concurrent processes, and, therefore, we will ignore parallelism for the most part.
Approach
Drawing inspiration from Christopher Alexander [Alexander77], Gamma et all. [Gamma94] popularized the idea that designing software systems can be greatly improved by using patterns. Instead of working up all the details of a software system, one can get inspiration from various design pattern to speed up the design process. In some sense, patterns are a formalization of collective experience; using this experience can greatly help the design process.
Here, we aim at leveraging patterns as a compact way of transmitting a body of experience in designing concurrent systems. Mixing and matching these patterns can help to solve a large variety of concurrency problems. Moreover, as we present some fundamental patterns, I would argue that it can help solve any concurrency problem – maybe, in some cases, not the best solution, but still a solution. The reader is strongly encouraged to see these patterns as building blocks and start playing with them to build more and more complex concurrency systems.
Because of the space constraints, we will expose a compact version for each pattern. We will say a few words about what each pattern is about, when to use it and, when appropriate, some key points. Diagrams seem to help a lot the process of reasoning about the design of a system, so we’ll make sure to include a diagram for each pattern we discuss. Also, for each pattern we’ll provide a short example in C++ code, using my Concore library [concore]2. One of the benefits of using these examples is to show that, using an appropriate library, one can easily express concurrency in C++. After all, concurrency doesn’t need to be one of the hardest areas in computer science.
Basic concurrency patterns
Create concurrent work
Description. Allows the creation of concurrent work; increases concurrency. Creates new tasks (typically two or more) from the existing task.
Representation. See Figure 1. We represent this pattern by multiple arrows originating from the single task, leading to new tasks. There can be more than two tasks created from one task, so there can be more than two arrows.
Figure 1 |
Example. See Listing 1. In the body of the first task (start()
function), we do some work, then we spawn two new tasks; the two tasks can be executed concurrently. For spawning the two tasks we use lambdas, as they are perfect for the job. Tasks are implemented using functors that take no arguments and return nothing. (We can build tasks that pass values around by binding values to the lambdas used to create the tasks.)
void start() { initComponents(); concore::spawn([]{ loadAssets(); }); concore::spawn([]{ initiliseComputationEngine (); }); } |
Listing 1 |
Discussion. There are multiple ways in which a task can be given to the task system; this example uses the spawn
function. Another way to do it is to use a global_executor
or some other type of executor. Executors, in Concore, can be used to customize how tasks are executed (they are somehow similar to the executors proposed for the C++ Standard, but, at least for the moment, they are not the same).
Key point. It’s important to ensure safety for the concurrent tasks that are created; i.e., there should be no race condition bugs between the set of tasks that are created. See [Teodorescu20b] for more details. In our example, showSplashScreen()
should not interfere with loadAssets()
.
When to use. To increase concurrency in an application, this should be used as often as our correctness allows (and performance does not degrade). With respect to performance, is often better to over-split the work into multiple concurrent processes than to split it less than needed – it’s much easier, later on, to combine work later on than to split it further. But, if we have enough tasks in the system, and we want to maximize performance, considering the results from [Teodorescu20a], we should split tasks so that the sizes of the tasks are several orders of magnitude higher than the overhead generated by the task framework.
Continuation
Description. Allows ‘do X after finishing Y’ workloads to be encoded. Allows decoupling between the first task and its continuation (see [Wikipedia]). Can also be used to split longer tasks into smaller ones, for performance reasons.
Representation. See Figure 2.
Figure 2 |
Example. Listing 2 shows an example of how this pattern can be used; we have an HTTP engine that makes async requests, and whenever the response comes back, it executes a continuation; the example shows just how the continuation is started, and how the top-level API can be used. One can see that the HTTP implementation is decoupled from the response handling logic; the latter is passed as a continuation to the HTTP engine.
void handleResponse(HttpResponseData respData, HandlerType handler) { // the work for this task: process the response HttpResponse resp = respData.toResponse(); // create a continuation to handle the response concore::task cont{[resp = std::move(resp), handler] { handler(resp); }}; concore::spawn(std::move(cont)); } void httpAsyncCall(const char* url, HandlerType handler) { // does HTTP logic, and eventually async // calls handleRespnse() } void useHttpCode() { // the work to be executed as a continuation HandlerType handler = [](HttpResponse resp) { printResponse(resp); }; // call the Http code asynchronously, passing the // continuation work httpAsyncCall("www.google.com", handler); // whenever the response comes back, // the above handler is called } |
Listing 2 |
Discussion. This is similar to the creation of concurrent work, but we just create one follow-up task. In this pattern, the follow-up task is always started at the end of the first task.
When to use. Mostly when we need to decouple two actions that need to be executed serially. Sometimes when we just need to break a long task into smaller tasks (which can improve performance by giving more flexibility to the task scheduler).
See also. Creating of concurrent work, serializers.
Join
Description. Allows the execution of work whenever multiple concurrent tasks are finished. The inverse of the creation of concurrent work pattern.
Representation. See Figure 3.
Figure 3 |
Example. Listing 3 shows how the problem from Listing 1 can continue; when both the two tasks are complete, a finish_task
is started. At the end of each task, they have to notify an event object to ensure that the fhish_task
starts at the right time.
concore::finish_task doneTask([]{ listenForRequests(); }, 2); // waits on 2 tasks // Spawn 2 tasks auto event = doneTask.event(); concore::spawn([event] { loadAssets(); event.notify_done(); }); concore::spawn([event] { initiliseComputationEngine(); event.notify_done(); }); // When they complete, the done task is triggered |
Listing 3 |
Discussion. The way this pattern is expressed, we create a task whenever the previous tasks are done, as opposed to somebody waiting for the tasks to be complete. In task-based programming, people are encouraged to minimize the number of waits in favour of creating new tasks.
When to use. Whenever several tasks need to complete before starting another task (e.g., they compute something that is needed for the successor task).
Fork-join
Description. This pattern is somehow a combination of the creation of concurrent work and the join pattern. But we present it here separately for its peculiar way of handing the stack and continuing the work; the thread that created the work also waits for the work to be completed, and its stack remains intact. This is actually a busy-wait. See [McCool12], [Robison14] for more details.
Representation. See Figure 4.
Figure 4 |
Example. Listing 4 shows a recursive generic algorithm that applies a functor over a range of integers. While the interval is large enough, it divides it and recursively applies the functor. It’s important to notice that, even if this works with tasks, the stacks for all the recursive calls are kept alive by the wait()
calls.
template <typename F> void conc_apply(int start, int end, int granularity, F f) { if (end - start <= granularity) for (int i = start; i < end; i++) f(i); else { int mid = start + (end - start) / 2; auto grp = concore::task_group::create(); concore::spawn([=] { conc_apply(start, mid, granularity, f); }, grp); concore::spawn([=] { conc_apply(mid, end, granularity, f); }, grp); concore::wait(grp); } } |
Listing 4 |
Discussion. As tasks are functions that do not take any parameters and don’t return anything, the way to pass information between tasks is via captured variables in the given functors/lambdas. Typically, if the stack is not available, the data passed between the tasks need to be allocated on the heap. By keeping the stack around, this pattern allows the user to avoid allocating data on the heap. It also simplifies the handling of the data (i.e., don’t need to pack the data in additional structures). This can save a lot of development time if one wants to improve concurrency for a piece of code that is found at the bottom of a somehow larger callstack.
Key point. This pattern waits on the caller thread/task. But, it’s important to realize that this is a busy-wait. If it cannot execute any tasks that have just forked, it will attempt to execute other tasks from the system in the hope that the forked tasks finish as soon as possible. While trying to maintain a constant throughput, this pattern may slightly damage the latency of certain operations.
When to use. Whenever the fork and the join need to happen in the same area of code, whenever we want to take advantage of the stack, or whenever it’s too complex to refactor the code to use continuations and regular join patterns.
Designing with basic patterns
After describing these basic patterns, we should pause and reflect on their usage. They can be combined in a lot of ways to describe any problem that can be expressed as a direct acyclic graph. Moreover, with a little creativity (i.e., creating some helper control tasks), we can also handle arbitrary restrictions [Teodorescu20b]. This means that all 4 of these basic patterns can be used to implement any concurrency problem. This is a powerful design tool.
Derived patterns
Task graph
Description. Allows the expression of directed acyclic graphs of tasks directly.
Representation. See Figure 5 for an example of a task graph.
Figure 5 |
Example. Listing 5 shows an example of how one can code the graph from Figure 5. After constructing the tasks, one can set the dependencies between the tasks to match the desired graph. To start executing, one has to schedule the first task from the graph.
std::shared_ptr<RequestData> data = CreateRequestData(); // create the tasks concore::chained_task t1{[data] { ReadRequest(data); }}; concore::chained_task t2{[data] { Parse(data); }}; concore::chained_task t3{[data] { Authenticate(data); }}; concore::chained_task t4{[data] { StoreBeginAction(data); }}; concore::chained_task t5{[data] { AllocResources(data); }}; concore::chained_task t6{[data] { ComputeResult(data); }}; concore::chained_task t7{[data] { StoreEndAction(data); }}; concore::chained_task t8{[data] { UpdateStats(data); }}; concore::chained_task t9{[data] { SendResponse(data); }}; // set up dependencies concore::add_dependencies(t1, {t2, t3}); concore::add_dependencies(t2, {t4, t5}); concore::add_dependency(t4, t7); concore::add_dependencies({t3, t5}, t6); concore::add_dependencies(t6, {t7, t8, t9}); // start the graph concore::spawn(t1); |
Listing 5 |
Discussion. The defined graph must be acyclic, otherwise, the tasks will not run. It’s also worth noting that the task graph doesn’t necessarily need to start with just one task; one can have graphs that have multiple starting points. This allows the modelling of much more complex flows.
When to use. Whenever the execution flow is clear upfront and/or the graph is slightly more complex.
Pipeline
Description. Allows the expression of data pipelines that can process items concurrently.
Representation. See Figure 6 for an example of a pipeline with 4 stages, 2 in order and 2 concurrent.
Figure 6 |
Example. Listing 6 shows a classic pipeline with stages for Decode, Fetch, Execute and Write. The Decode and Write stages need to run the elements in the order in which they are pushed to the pipeline, but the other two stages can be executed concurrently. For any item pushed through the pipeline, all the stage functions will be executed in order; they would all receive the shared pointer to the same data. We gain concurrency by allowing multiple items to be in the Fetch and Execute stages. The execution of the Decode and Write stages is serialized, and the items are processed in the order in which they are pushed.
using LinePtr = std::shared_ptr<LineData>; auto my_pipeline = concore::pipeline_builder<LinePtr>() | concore::stage_ordering::in_order | [](LinePtr line) { Fetch(std::move(line)); } | concore::stage_ordering::concurrent | [](LinePtr line) { Decode(std::move(line)); } | [](LinePtr line) { Execute(std::move(line)); } | concore::stage_ordering::in_order | [](LinePtr line) { Write(std::move(line)); } | concore::pipeline_end; for (int i = 0; i < num_lines; i++) my_pipeline.push(get_line(i)); |
Listing 6 |
Discussion. Each item that goes through a pipeline must follow a certain number of stages, in sequence. But, in some cases, several items can go through the pipeline concurrently. A pipeline can typically limit the maximum number of items that are processed concurrently. In a classical pipeline, processing items at any stage is ordered, but one may want to relax these restrictions. In the above example, we enforce the Fetch and the Write stages to be ordered, but we didn’t impose any limit on the middle stages; the middle stages are allowed to be fully concurrent. Between an ordered restriction (first and last stages) and no restrictions at all, there is another type of restriction that one may want to use: out-of-order serial. In this mode, the system is allowed to execute at most one task per stage, but it doesn’t need to be in order.
Key point. This pattern is a great example of how to change an apparently sequential system and add concurrency to it. Improving concurrency is directly related to relaxing some of the constraints of the original model. The first constraint we drop is that we can execute a maximum of one item concurrently; it turns out that if we keep the input and the output stage ordered, most of the time we don’t need this constraint. Also, if one can move most of the work in a pipeline in stages that are not fully concurrent, this can improve concurrency a lot; for example, if one spends more than half of the total time in concurrent stages of the pipeline, then given that we have enough items that flow through the pipeline, the concurrency will steadily grow.
When to use. Whenever we have a process that needs to execute sequentially some steps over some items, but some of the steps can run concurrently.
Serializers
Serializers are presented in detail in the previous article [Teodorescu20b], so we won’t cover them here. The main idea that we want to stress here is that they are often design patterns in the concurrency world. In my experience, I find that using a serializer is one of the most frequent first-choices whenever expressing constraints between tasks; maybe a bit too often.
Data transformation patterns
Concurrent for
Description. Allows the concurrent execution of a body of work over a collection of items, or transforming a collection of elements with a mapping function.
Representation. See Figure 7 for a representation of a data transformation. Yellow/light circles and diamonds represent data.
Figure 7 |
Example. Listing 7 shows how one can apply a transformation to a collection of elements.
std::vector<int> ids = getAssetIds(); int n = ids.size(); std::vector<Asset> assets(n); concore::conc_for(0, n, [&](int i) { assets[i] = prepareAsset(ids[i]); }); |
Listing 7 |
Discussion. This looks very much like a for
structure, in which all the iterations can be executed concurrently. Because of that, it’s probably the easiest form of concurrency.
When to use. Whenever the iterations of a for
loop are independent of each other.
Concurrent reduce
Description. A concurrent version of std::accumulate
, allowing reduction over a collection of items.
Representation. Figure 8 shows the inner tasks involved in reducing a collection of 4 elements.
Figure 8 |
Example. Listing 8 shows how one can use concurrent reduce operations to compute the total memory consumption for a vector of resources. It’s assumed that getting the memory consumption of one resource is independent of getting the memory consumption for another resource.
std::vector<Resource> res = getResources(); auto oper = [&](int prevMem, const Resource& res) -> int { return prevMem + getMemoryConsumption(res); }; auto reduce = [](int lhs, int rhs) -> int { return lhs + rhs; }; int totalMem = concore::conc_reduce(res.begin(), res.end(), 0, oper, reduce); |
Listing 8 |
Discussion. This is similar to a concurrent for
, but also reduces the results obtained from all iterations into one value. The reduction operation is not linearly performed as in the case of a traditional for
loop, and therefore the reduction needs to be associative.
When to use. Whenever one needs a reduction over a collection, and the operations involved can run concurrently.
Concurrent scan
Description. Allows concurrent execution of loops that have dependencies of the results computed in the previous iteration; implements a concurrent version of std::partial_sum
.
Representation. Figure 9 shows the processing needed to apply a concurrent scan over 8 elements.
Figure 9 |
Example. Listing 9 shows an example in which we have a vector of feature vectors, and we want to successively combine these, and keep all the intermediate results.
std::vector<FeatureVector> in = getInputData(); std::vector<FeatureVector> out(in.size()); auto op = [](FeatureVector lhs, FeatureVector rhs) -> FeatureVector { return combineFeatures(lhs, rhs); }; concore::conc_scan(in.begin(), in.end(), out.begin(), FeatureVector(), op); |
Listing 9 |
Discussion. Naively, as one needs the result of the previous iteration in the current iteration (i.e., outi = outi -1 ⊕ ini), one may think that this cannot be done concurrently. But, if the operation that we apply is associative, then this can also be transformed into a concurrent algorithm.
Key point. The implementation of this concurrent algorithm does more work than its serial counterpart. Adding more work is sometimes used to add concurrency to an algorithm; the hope is that even with the added work, the added concurrency will make the algorithm faster.
When to use. Whenever one needs to do a prefix sum type of algorithm and the speed of the algorithm should be improved by running it in parallel.
Other patterns
There are a lot of other patterns used in concurrent design. [McCool12] provides a much more extensive collection of patterns to be used in the concurrent world. Patterns like stencil, geometric decomposition, pack, expand, gather, scatter are some of the patterns described in the book that we haven’t touch at all here.
Although, not expressed in terms of tasks, [Buschmann07] also provides a series of patterns that can be used to build concurrency. Some of the patterns there encourage people to use synchronization primitives, but as previously shown [Teodorescu20b], one can always model them with tasks.
I believe that, by now, the reader should have a good intuition on how serial programming patterns can be transformed into concurrent patterns; how expressing the problems in terms of tasks can ease the design of concurrent applications.
Final words
This article provides a short catalogue of design patterns that are applicable for building concurrent applications. It can serve both as a quick introduction into designing concurrent applications for those who never designed concurrent applications with tasks, and also a refresher for those who are experienced with task-based programming.
One important point that the article tries to make is that there is no need to use synchronization primitives while designing for concurrency. A task-based system is enough for the job. Moreover, the patterns exposed here try to highlight that, in some cases, designing with such primitives it’s much easier than doing multithreading with explicit threads and synchronisation primitives.
From a design perspective, it’s much easier to reason in terms of these patterns compared to reasoning systems built with synchronisation primitives. If the preconditions are met (tasks are independent, as required by the patterns), then one can fully reason about the implications of using such a pattern. There are no hidden dependencies with other parts of the system (as opposed to a lock-based system; see [Lee06]). One can say that the concurrency of such a pattern is fully encapsulated within the pattern itself. This is a huge step forward for designing concurrent applications.
References
[Alexander77] Alexander, Christopher. A pattern language: towns, buildings, construction. Oxford university press, 1977
[Buschmann07] Frank Buschmann Kevlin Henney, Douglas C. Schmidt, Pattern-Oriented Software Architecture: A Pattern Language for Distributed Computing (Volume 4). Wiley, 2007.
[concore] Lucian Radu Teodorescu, Concore library, https://github.com/lucteo/concore
[Gamma94] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley Professional, 1994
[Hoare85] C.A.R. Hoare, Communicating Sequential Processes, Prentice-Hall; http://www.usingcsp.com/cspbook.pdf
[Lee06] Edward A. Lee, The Problem with Threads, Technical Report UCB/EECS-2006-1, 2006, https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf
[McCool12] Michael McCool, Arch D. Robison, James Reinders, Structured Parallel Programming: Patterns for Efficient Computation, Morgan Kaufmann, 2012
[Pike13] Rob Pike, ‘Concurrency Is Not Parallelism’,https://www.youtube.com/watch?v=cN_DpYBzKso
[Robison14] Arch Robison, A Primer on Scheduling Fork-Join Parallelism with Work Stealing, Technical Report N3872, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3872.pdf
[Teodorescu20a] Lucian Radu Teodorescu, ‘Refocusing Amdahl’s Law’, Overload 157, June 2020 available at https://accu.org/journals/overload/28/157/teodorescu_2795/
[Teodorescu20b] Lucian Radu Teodorescu, ‘The Global Lockdown of Locks’, Overload 158, August 2020, available at https://accu.org/journals/overload/28/158/teodorescu/
[Wikipedia] Wikipedia, ‘Continuation-Passing Style’, https://en.wikipedia.org/wiki/Continuation-passing_style
Footnotes
- We use the term processes here in the same way that Hoare uses it in his seminal book [Hoare85]; it means any body of work; not to be confused with OS processes.
- Concore is not yet a production-ready library; things may change in the future, both in terms of API and of features.
has a PhD in programming languages and is a Software Architect at Garmin. In his spare time, he is working on his own programming language and he is improving his Chuck Norris debugging skills: staring at the code until all the bugs flee in horror.