Elements of Concurrency

Concurrency suggests you can’t be sure what order instructions happen in. Lucian Radu Teodorescu shows how concurrency actually gives a strict partial ordering.

I’ve long been attracted to a quote often attributed to Einstein (whether or not he said it, it’s good advice):

Everything should be made as simple as possible, but no simpler.

I take it as a guiding principle for intellectual work on subjects with high inherent complexity. Software is one such subject; Brooks famously called it essential complexity [Brooks95]. Moreover, concurrency is one of the sharpest edges of that complexity.

I have been writing about concurrency in Overload for years, but those articles were mostly practical: tutorials, patterns, pitfalls, and implementation techniques. Here I want to step back and talk about the essence of concurrency.

I will build on the work of Tony Hoare and collaborators on the laws of programming with concurrency [Hoare09, Hoare11, Hoare13, Hoare15] (the talks are particularly enjoyable). I’ll also borrow Leslie Lamport’s viewpoint that concurrency is about the ordering of events as presented in the ‘Time, clocks, and the ordering of events in a distributed system’ article [Lamport78].

The central thesis of this article is simple: Concurrency is strict partial ordering.

Once you start seeing programs as collections of work items related by a strict partial order <, much of the apparent complexity becomes a matter of making ordering constraints explicit, manipulating them algebraically, and choosing implementations that respect them.

Some parts are a bit formal, but the payoff is a shift in perspective. Rather than treating concurrency as a bag of mechanisms (threads, locks, async, executors), we treat those mechanisms as different ways of expressing and enforcing the same ordering structure.

TL;DR

Concurrency = strict partial ordering between work items.
Sequential (;) and concurrent (||) composition are just ways of adding ordering constraints.
The exchange law is the key bridge between sequential and parallel structure; it enables modular reasoning.
Many familiar constructs (fork–join, joins, mutexes, serialisers, schedulers) can be understood as mechanisms that add, remove, or delay ordering constraints.
Non-local scheduling amounts to repeatedly selecting the set of currently-ready work items.
Concurrency can be encapsulated by aligning scheduling boundaries with dependency boundaries.

What is concurrency?

A program is non-concurrent if its work items are totally ordered: for any two distinct work items a and b, either a < b or b < a.

A program is concurrent if the execution of work items is governed by a strict partial ordering relation. In this case, we have the following three execution possibilities:

a < b
b < a
neither a < b nor b < a

The third option is specific to concurrent execution. In this case, the execution of a may proceed concurrently with the execution of b.

This is the essence of concurrency. Much of what we call concurrency either follows from this ordering view, or belongs to the choice of implementation.

By the definition of strict partial ordering, we have the following properties:

irreflexivity: ¬(a < a)
asymmetry: if a < b then ¬(b < a)
transitivity: if a < b and b < c then a < c

These properties rule out cycles such as a < b < a, which would be an impossible execution constraint.

When we say a < b, we mean only that a is ordered before b. We do not assume a particular memory-visibility or synchronisation model (unlike C++’s happens-before [C++Ref]).

Hoare’s laws of concurrency

Hoare’s laws of programming with concurrency [Hoare09, Hoare11, Hoare13, Hoare15] provide an algebraic vocabulary for talking about concurrent structure without committing to a particular programming language, runtime, or hardware model. The aim is not to model every low-level effect (such as memory visibility), but to capture the high-level ordering constraints that make a program more or less concurrent (where more concurrent means fewer ordering constraints).

We will work with a small set of operators (;, ||, 1) and a refinement relation => that lets us compare designs, programs, and executions. The central idea is that many familiar concurrency patterns (fork–join, barriers, mutexes, serialisers, schedulers) can be understood as ways of adding or removing ordering constraints.

This approach complements the definition above: from the structure of a term built with ; and || we can read off a dependency graph (a strict partial order) between work items.

Formalism

Let variables p, q, r, … stand for specifications, designs, programs, parts of programs or work items. They describe behaviours of a computer that are desired, planned, or actually executed when the program is executed.

We define sequential composition, p ; q, as the operation resulting from executing operation q immediately after operation p. From the definition we imply that p < q. A simple example of sequential composition is calling two functions, one after another, in a C-like language: f(); g();.

We also define concurrent composition, p || q, as the operation that executes both p and q at the same time (concurrently). That is, we express the relation ¬(p < q) ∧ ¬(q < p) (no ordering constraint between p and q). As an example, one can imagine starting two threads at the same time, one executing p and the other one executing q.

Besides these two operations, we also define 1 as the unit operation, the operation that does nothing if executed.

With these defined, we have the following axioms:

(p ; q) ; r = p ; (q ; r) – associativity of ;
(p || q) || r = p || (q || r) – associativity of ||
p ; 1 = 1 ; p = p – 1 is the unit with respect to ;
p || 1 = 1 || p = p – 1 is the unit with respect to ||
p || q = q || p – || commutes

The astute reader might have noticed that our exposition (which follows Hoare’s exposition) mixes specifications with programs, and expected behaviour with actual behaviour. This is intentional. The laws described here are very powerful and work at different levels, and thus cover a lot of ground. In practice, the intended reading will be clear from context.

Having our variables stand for multiple kinds of entities allows us to make use of refinement. We say that p refines q, noted as p => q, if every execution described by p is also described by q. Equivalently, p allows fewer behaviours than q. This means that p is more determinate, while q is more abstract. Another way to view this relation is that p adds more constraints than q has.

The following are examples of p => q refinement:

p is a program that meets specification q;
p is a program execution of the program q;
design p is more constrained than the design q.

Refinement is a weak partial order relation, thus it obeys the following axioms:

reflexivity – p => p;
anti-symmetry – if p => q and q => p, then p = q;
transitivity – if p => q and q => r, then p => r.

An operator • is called monotonic if p => q implies p • r => q • r and r • p => r • q. Monotonicity is closely related to composability; if a system has two components combined with a monotonic operator, and we improve one component, then the entire system is better.

In our case, both ; and || are monotonic. This means that:

p => q implies p ; r => q ; r and r ; p => r ; q
p => q implies p || r => q || r and r || p => r || q

And now, we finally reach the exchange axiom, which is (in Hoare’s own words) a “remarkably powerful law”. This axiom states:

(p || q) ; (p' || q') => (p ; p') || (q ; q')

Intuitively, exchange explains when global phase ordering can be replaced by independent componentwise progress.

Putting this axiom into a graphical form, we get Figure 1 and Figure 2 (corresponding to the left and right sides of the axiom). Figure 1 refines Figure 2: it imposes all constraints of Figure 2, plus additional cross-component constraints. In our case, the ordering constraints shown in Figure 2 are p < p' and q < q'. In addition to those, Figure 1 also adds the following constraints: p < q' and q < p'.

Figure 1

Figure 2

We’ve arrived at an important intuition: there is a correspondence between laws of concurrency (as expressed in algebra) and what can be expressed by the strict partial ordering. We are not going to prove this here.

Derived laws

Substituting 1 into the terms of the exchange law allows us to directly obtain the following laws:

(p || q) ; q' => p || (q ; q')
(p || q) ; p' => (p ; p') || q
p ; q => p || q

Remember: p => q means “p is more constrained than q;”. So p ; q => p || q says that sequential execution is a special case of concurrent execution with an extra ordering constraint.

For example, to show the first law in the first bullet, we can substitute p' with 1 in the exchange law, and thus obtain: (p || q) ; (1 || q') => (p ; 1) || (q ; q'). After simplifying 1, we obtain: (p || q) ; q' => p || (q ; q'), which is exactly the law in the first bullet.

The first two bullets express that sequential progress of one component distributes through a concurrent context. The third bullet point states that purely sequential composition is more constrained than unconstrained concurrency.

Hoare also shows that the modularity rule is equivalent to the exchange law. The modularity rule says that if p ; q => r and p' ; q' => r', then (p || p' ; (q || q') => r || r'. The modularity rule allows separate reasoning about two parts of a concurrent program, and then combining the results.

From p ; q => r and p' ; q' => r', monotonicity of || gives (p ; q) || (p' ; q' => r || r'. Exchange gives (p || p') ; (q || q' => (p ; q) || (p' ; q').By transitivity, we get (p || p' ; (q || q') => r || r', which is the conclusion of the modularity rule.

Benefits of the exchange law

The exchange axiom is the fundamental law relating sequential and concurrent composition. Without it, the operators ; and || would be algebraically independent, and no general reasoning principle could connect sequential and concurrent structure. The axiom therefore provides the essential connective tissue of the algebra.

The exchange law enables compositional reasoning about concurrent systems. We may reason about the evolution of each concurrent component independently and then recombine the results. In refinement terms, the behaviour of the whole system is determined by the componentwise progress p < p' and q < q'. This captures the core intuition of concurrency: independent components may advance independently. For this reason, the exchange law is often explained as a kind of distributivity principle for concurrent progress.

Exchange expresses that ordering constraints propagate locally along each component. The right-hand side constrains only p < p' and q < q', while the left-hand side introduces additional cross-component constraints (p < q', q < p'). Thus exchange states that a system whose components evolve independently is less constrained (more abstract) than one that enforces global phase ordering.

This supports locality: improving one component can improve the whole without global restructuring.

The law tells us that concurrency may be realised by interleaving (when the platform supports it). Also, because exchange is a refinement law, it licenses correctness-preserving program transformations. For example, parallel blocks may be reordered or fused by replacing a more constrained execution structure with a less constrained one (when the dependency graph permits it). Many optimisation patterns for task graphs and pipelines are instances of exchange.

In summary, the exchange axiom captures the essential property of concurrency: systems composed of independent components evolve componentwise.

Relations to other formalisms

The formalism presented here is a streamlined retelling of Hoare’s work on Concurrent Kleene Algebra [Hoare09, Hoare11]. Hoare not only developed the algebraic laws used above, but also established their correspondence with partial-order models of concurrency, such as pomsets (partially ordered multisets of events). In that setting, a program denotes a set of event occurrences equipped with a strict partial order, and sequential and concurrent composition correspond to adding or omitting ordering constraints between them.

Hoare also connects this algebra to the compositional spirit of Hoare logic [Wikipedia-1]. In Hoare logic, reasoning scales by composing proofs along program structure; here, refinement and monotonicity ensure that improvements to subterms lift to improvements of whole terms. The exchange law plays the role that makes sequential and concurrent reasoning interact smoothly.

Hoare further relates this algebra to Milner’s Calculus of Communicating Systems (CCS) [Wikipedia-2]. Both frameworks admit operational (transition-based) and partial-order interpretations, and refinement in the algebra aligns with behavioural inclusion in process calculi.

Taken together, these connections show that the same structure reappears across algebraic, logical, and operational accounts of concurrency. That convergence strengthens the claim that strict partial ordering lies close to the essence of concurrency.

Implementation strategies

Now that we’ve looked at the theoretical foundations, let’s reinterpret a few common implementation strategies through the lens of algebra and strict partial ordering.

Fork-join model

Consider the following program [Wikipedia-3]:

  a()
  let f = spawn { b() }
  c()
  _ = f.await()
  d()

This is a fork–join pattern: b() runs concurrently with the continuation that executes c(), and await() acts as the join.

In our algebra, this corresponds to: a ; (b || c) ; d, and it induces the ordering constraints:

a < b and a < c
b < d and c < d
neither b < c nor c < b

(In this translation to algebra, we did not translate spawn and await, as we assume implementations can implement them at negligible costs; we only focused on the meaningful work. If, instead, we wanted to model a more accurate execution, the work of spawn would need to be captured in a, b and c, while the work for await needs to be captured in b, c and d.)

The particular surface syntax is not important. The same structure can be expressed as a library feature, with C++ senders/receivers, or with async/await [Wikipedia-4].

We assume the join does not require a blocking wait. A common implementation strategy is to ensure that whichever thread finishes last (executing either b() or c()) continues with d(). This may run d() on a different thread than a(), but the induced ordering constraints are unchanged.

This way of expressing concurrency closely resembles our operators for sequential and concurrent composition. This is the simplest structured-concurrency building block. Glossing over some of the detail, using a principled point of view, all concurrent programs can be built on top of this structure (leaving the proof as an exercise for the reader).

Thread spawning and joining

We can encode the same high-level pattern in C++ using std::thread:

  a();
  auto t = std::thread{[]{ b(); }};
  c();
  t.join();
  d();

Relative to the fork–join model above, there are two important differences: 1) join() may block the calling thread; 2) d() is guaranteed to execute on the same thread as a().

The second difference is mostly an implementation detail; it doesn’t affect the ordering constraints we care about. The first one matters, because a blocking join introduces an additional work item: waiting.

Let # denote a blocking wait whose duration depends on the completion of some concurrent work. Then the main thread’s behaviour is: a ; c ; # ; d, where the execution of # depends on the execution of b. One convenient algebraic representation is: a ; (b || (c ; #)) ; d. The dependency ‘# waits for b’ is part of the intended interpretation of #, not something expressed directly by ; and ||.

Introducing the concept of time/waiting here is a slight departure from our formalism, in which time was not directly represented. We could have written the representation as a ; ((b ; i) || (c ; #)) ; d, where i=1, and add the constraints that ¬(# < i). This says that # cannot finish before the other thread is finishing, without using the notion of time. However, the reader will probably find it easier if we mention time explicitly.

This has the same ordering constraints as fork–join plus the fact that the main thread spends time in #. Two practical consequences follow:

Performance overhead: the blocked thread is occupied with waiting instead of doing useful work.
Weaker locality: reasoning about the main thread now depends on the termination of b (because # is coupled to b).

If blocking is acceptable (or performance is irrelevant), the extra work item # can be ignored in high-level reasoning, and the pattern reduces to the non-blocking fork–join constraints.

Mutexes

A mutex imposes an exclusivity constraint: two work items associated with the same mutex cannot be executed concurrently.

Let M be a mutex and let x, y, z, … be work items protected by it. We write x/M, y/M, z/M, … to indicate that these work items are mutually exclusive with respect to M. In other words, x/M is an operator that expands into # ; x, where # depends on other work items that may be executing under M. This is an implementation-level elaboration, not part of the core algebra. This operator adds extra ordering constraints between work items protected by it.

If x/M and y/M appear in context as a ; x/M ; b and c ; y/M ; d, then we have the constraints:

a < x and x < b
c < y and y < d
x < y or y < x (exclusivity under M)

With more than one mutex, inconsistent constraints can arise, and thus we have a soundness issue. Consider two mutexes M1 and M2 and the concurrent program: P = (x/M1)/M2 || (y/M2)/M1.

Intuitively, this is the classic ‘lock-order inversion’ pattern. One way to expose the conflict in our framework is to insert units to make the phase structure explicit: P1 = (i ; x/M1)/M2 || (j ; y/M2)/M1, where i=j=1 (operations that don’t do anything).

From M2, we must have (i ; x) < y or y < (i ; x). From M1, we must have x < (j ; y) or (j ; y) < x. Now apply the exchange axiom to the units and the protected actions: P2 = (i || j) ; (x || y) => (i ; x) || (j ; y), which corresponds to the intuition that both protected actions can be enabled ‘after’ both units.

The reader can think of P2 as a possible interleaving of P1, and thus a valid execution of the program. P2 is more constraint than P1, thus it follows all the constraints from P1, plus some more:

i < x
j < y
(i ; x) < y or y < (i ; x) (from M2)
x < (j ; y) or (j ; y) < x (from M1)
i < y and j < x (from P2)

Combining the last three bullets, we can reach both x < y and y < x simultaneously, yielding an impossible strict partial order. The program is unsound. In operational terms, this manifests as deadlock: there is no schedule that can satisfy all required constraints.

Like join(), mutex acquisition typically introduces waiting. This is because in expressions a ; x/M ; b, a typical implementation wants to execute x inline, as early as possible, before executing b, on the same thread as a and b. If another mutex-protected work item is executing under M, we need to delay the execution of x; the most common strategy is to perform a blocking wait. This corresponds to an execution of a ; # ; x ; b, where the length of # depends on other operations executed under M. This is a second way mutexes can harm both performance and locality: the waiting time is externally determined by other threads.

Serialisers

Serialisers are an alternative to mutual exclusion that avoids blocking by turning exclusivity into queued execution. In industry, similar ideas appear under different names (e.g., Intel oneTBB local serializer, Boost.Asio strands, and GCD dispatch queues [Intel] [AsioStrands] [Apple]).

To model a serialiser in our framework, we separate two distinct actions:

enqueue: a local action that requests work to be run under the serialiser;
run: the (possibly later) execution of that work under the serialiser.

Let S be a serialiser. For each work item x that is submitted to S, write x/>S for the enqueue action, and keep x for the eventual execution of the work. The key properties of the serialiser are:

enqueue happens before the work can run: x/>S < x;
the serialiser imposes a total order on the executions associated with S: for any x, y submitted to S, either x < y or y < x.

With this notation, we can represent ‘submit x to serialiser S’ as x/>S. That is, x/>S records the local enqueue; the actual execution x will occur later, ordered after the enqueue and ordered relative to other work items executed under S.

If x/>S and y/>S appear in context as a ; x/>S ; b and c ; y/>S ; d, then we have the constraints:

a < b and a < x/>S
c < d and c < y/>S
x/>S < x and y/>S < y
x < y or y < x (serialisation under S)

Unlike mutexes, we do not require x < b (or y < d). The work item under the serialiser may run later, outside the lexical scope where it was enqueued. This is precisely how serialisers can avoid blocking waits: the producer proceeds, while the serialiser schedules the work item when it becomes eligible.

This non-local execution also avoids the deadlock pattern above: the program (x/>S1)/>S2 || (y/>S2)/>S1 can enqueue both work items without requiring an immediate consistent ordering between their executions. The local work of this program consists of the enqueue actions x/>S1 and y/>S2 (which are not blocking waits), while the executions x and y occur later, outside the lexical scope.

Serialisers therefore depart from the purely local patterns above: they introduce non-local concurrency, because work items execute outside the scope in which they were created. Handling such dynamic scopes is outside the scope of this paper; readers can consult C++26’s async_scope-style facilities (e.g., [P3149R11]) for one approach to scoping non-sequential concurrency.

Non-local concurrency

Non-local concurrent scheduling is the repeated application of a partition function over the set of work items. Let me explain…

Up to the serialiser examples, we have implicitly assumed that the set of work items is fixed and that the concurrency structure is visible in the syntax of the program. In other words, the dependency graph (the strict partial order) can be read directly from the term structure using ; and ||. In practice, however, the shape of a concurrent computation is often data-dependent: conditionals choose which work items exist, loops create unboundedly many instances, and higher-level schedulers (executors, queues, work-stealing pools) decide where and when enqueued work runs. Once scheduling decisions are driven by run-time state, the concurrency structure can no longer be captured by a single static term; it must be described as a dynamic process.

To reason about these dynamic processes, we can extend our algebra with new operators. That would take too much space. Instead, we will use the strict partial ordering as a main tool to reason about non-local concurrency. At any point at which we look at work items, we can assume that those work items are dynamically created.

Let W be the set containing all the work items in a program, for its entire execution. We treat W as the set of all work items that will ever exist in a complete execution (a hindsight view); at runtime only a prefix of W is known, but our reasoning quantifies over W to describe the evolving computation.

Concurrent scheduling is the part of the program that decides which work items in W may start executing next. If the work items and their dependencies are directly expressed in the control flow of the program, then the compiler largely determines the schedule, and dependency discovery is implicit in control flow.

Let s_t be an operation that schedules work items for execution, at some time t (we treat scheduler invocations as work items too). We have more of these operations in the program, but let’s focus on one. The scheduler invocation decides to start a set of work items; let’s denote that set by WS_t. Let’s also denote by WR_t the set of work items p for which s_t < p – this is the set of work items that remain to be scheduled. Intuitively, s_t is the now boundary: s_t < p means p is not started before this scheduling step. By construction we have WS_t ⊆ WR_t.

In other words, operation s_t partitions the set WR_t in two parts: WS_t and WR_t - WS_t. Thus, concurrent scheduling is repeated partitioning of the set of remaining work items.

While this partition can be implemented in many ways, we need to discuss:

minimal guarantees to be provided by the partition function;
an efficient implementation of such a function;
the times in which it makes sense to apply this partition function.

As minimal guarantee, the selected set WS_t must only contain elements p for which, if q < p, then q < s_t. In plain English, we cannot schedule for execution a work item for which predecessors haven’t yet completed.

When considering efficiency, in practice we often use a greedy scheduler that does not depend on the kinds of work items it schedules: at a given time it tries to start as many work items as possible. To make this precise, we assume that at time t every work item in W is in exactly one of three states: remaining (WR_t), executing (WE_t), or completed (W - (WR_t ∪ WE_t)). Under this assumption, the minimal guarantee from above (‘all predecessors are completed’) can be stated as: start only those p ∈ WR_t whose predecessors are not in WR_t ∪ WE_t. The greedy choice is then to take the largest such set, i.e. the set of all currently-ready work items: WS_t = { p ∈ WR_t | ∄ q ∈ (WR_t ∪ WE_t), q < p }. In practice, schedulers might have some other consideration that put additional bounds to the set of work items that can be started; these are typically number of workers available, queue capacity, priorities, affinity, etc.

The only other important question is: how often should we run the scheduler? Practical wisdom tells us that we should do it whenever new work items are created, and whenever they are completed. Indeed, those are the moments when WR_t ∪ WE_t from our exposition above might change. Running the scheduling algorithm more often doesn’t usually make sense, as we cannot possibly find new work items to schedule (again, in practice, we might have additional concerns like: resources becoming available, priorities changing, etc.).

Modularity

The discussion above on non-local concurrency might lead to the false impression that there must be a single global scheduler for all work items in the program. While such a design is possible, it would concentrate scheduling decisions in one place and make reasoning about independent parts of the program harder. A more modular approach is to allow multiple schedulers, each responsible for a subset of work items.

Recall that the strict partial ordering < induces a directed acyclic graph (DAG) of work items. In many realistic programs this graph is sparse: it is composed of loosely connected subgraphs with relatively few ordering relations between them. For example, work items protected by one mutex are often largely independent of work items protected by another mutex. The cases in which all mutexes in a program protect the same region of code are rare.

This observation suggests that we can partition the global set of work items W into subsets W_0, W_1, …, W_n, such that the number of ordering relations p < q with p ∈ W_i and q ∈ W_j (for i ≠ j) is relatively small. Each subset can then be assigned its own scheduler, applying the same partition principle locally.

Consider a simple case in which there is a single cross-subset dependency p < q, where p∈W_0 and q ∈ W_1, and no dependency in the opposite direction. Under this assumption, the scheduler for W_0 can operate independently: it can schedule all work items in W_0 as soon as their local predecessors complete. The scheduler for W_1, however, must respect the fact that q cannot execute before p.

One practical strategy is to defer the creation (or, equivalently, the enqueueing) of q until p completes. In this way, the cross-subset dependency is handled at the boundary: when p finishes, it produces or enqueues q into the scheduler associated with W_1. As a consequence, all work items that (directly or indirectly) depend on q are created or enqueued only after p completes. With this discipline, the scheduler associated with W_1 does not need to inspect the internal state of W_0; it only reacts to the arrival of new work.

In practice, serialisers and similar abstractions often come with their own scheduler implementations. The user coordinates between multiple serialisers by expressing cross-dependencies through enqueueing at well-defined points (for example, when a predecessor completes). This allows each serialiser to maintain a simple and efficient local scheduling policy, while preserving the global strict partial ordering of work items.

Here is a simple example of communicating between two serialisers, each serialiser performing their own scheduling; the example shows 3 expressions in the program that makes the connection between the serialisers:

  p/>S0            // enqueue p onto S0
  p_done ; (q/>S1) // when p completes, enqueue q 
                   // onto S1
  q                // q eventually runs under S1 
                   //(after q/>S1)

This realises the cross-dependency p < q via p < (q/>S1) < q, without requiring S1 to inspect S0.

Modularity, in this view, is achieved by aligning scheduling boundaries with structural boundaries in the dependency graph. As long as cross-subset ordering relations are handled explicitly at those boundaries, each scheduler can reason locally about its own subset of work items without requiring global knowledge of the entire program.

At a deeper level, this modular structure is justified by the exchange law. Exchange tells us that when two components evolve independently, global phase ordering refines independent local progress. This is exactly the situation in which encapsulation is sound: if cross-subset dependencies are handled explicitly at well-defined boundaries, then reasoning about each subset locally is sufficient to reason about the whole. The algebra guarantees that recombining the local progress preserves correctness.

Concurrency can be encapsulated.

Conclusions and practical recommendations

This article argued that concurrency has a simple underlying idea: a program induces a strict partial order between its work items. Hoare’s concurrency algebra offers an equivalent, calculational view of the same structure, and it connects naturally to other established formalisms such as Hoare logic and Milner’s CCS.

Once this viewpoint is adopted, many common concurrency constructs can be seen as elaborations that add or remove ordering constraints. The practical benefit is a shift in how we design and reason about concurrent software: instead of starting from mechanisms (threads, locks, callbacks), we start from the ordering constraints that must hold, and then choose an implementation strategy that enforces them.

Based on the core ideas in this article, here are a few practical recommendations:

Start from relations, not mechanisms. Identify the work items in the problem and write down the ordering constraints that must hold between them.
Make dependencies explicit. Prefer APIs and designs that surface the relevant a < b constraints instead of encoding them implicitly in ad hoc coordination.
Prefer static structure when possible. When dependencies can be expressed directly in control flow (structured concurrency, task graphs), do so; it simplifies both reasoning and scheduling.
Keep constraints local. Aim to express most constraints within a small scope (a module, a component, a serialiser/strand) rather than relying on global coordination.
Handle cross-scope dependencies at boundaries. When non-local concurrency is required, make boundary handoffs explicit (e.g., enqueue-on-completion) so that each local scheduler can remain simple.
Avoid manual synchronisation when you can. Synchronization primitives (mutexes, semaphores, etc.) typically introduce extra work items (waiting) and can harm both performance and locality, and they increase the risk of bugs (race conditions, deadlocks, etc.).

Following these recommendations helps make concurrent programs as simple as possible (but no simpler).

Acknowledgements

This article is primarily shaped by the work of Tony Hoare and Leslie Lamport. Hoare’s laws of programming with concurrency and Lamport’s partial-order view of events provided much of the conceptual vocabulary I use here.

My own approach to concurrency has also been strongly influenced over the years by talks, writing, and conversations with Kevlin Henney, Sean Parent, Eric Niebler, Dave Abrahams, and Dimi Racordon. Even when their contributions are not cited directly, their way of framing problems towards composability, local reasoning, and practical correctness shows up throughout this article.

Tony Hoare

Tony Hoare died at the age of 92 on March 5 this year. He gave us quicksort, Hoare logic and much more besides. In an ACM blog, Bertrand Meyer said:

Style is the defining characteristic of Hoare’s work, not just writing style but scientific style, always elegant and focused on what truly matters.

The post (‘Tony Hoare and His Imprint on Computer Science’) is available at: https://cacm.acm.org/blogcacm/tony-hoare-and-his-imprint-on-computer-science/.

References

[Apple], Apple, DispatchQueue, https://developer.apple.com/documentation/dispatch/dispatchqueue, accessed Feb 2026.

[AsioStrands] Christopher M. Kohlhoff, Strands: Use Threads Without Explicit Locking, https://www.boost.org/doc/libs/latest/doc/html/boost_asio/overview/core/strands.html, accessed Feb 2026.

[Brooks95] Frederick P. Brooks Jr., The Mythical Man-Month (anniversary ed.), Addison-Wesley Longman Publishing, 1995.

[C++Ref] cppreference.com, std::memory_order, https://en.cppreference.com/w/cpp/atomic/memory_order.html, accessed Feb 2026.

[Hoare09] Tony Hoare, Bernhard Möller, Georg Struth, Ian Wehrman, ‘Concurrent Kleene algebra’, in: M. Bravetti, G. Zavattaro (Eds.), Concurrency Theory (CONCUR 2009), LNCS, vol. 5710, Springer, 2009.

[Hoare11] Tony Hoare, Bernhard Möller, Georg Struth, Ian Wehrman, ‘Concurrent Kleene algebra and its foundations’, The Journal of Logic and Algebraic Programming 80.6, 2011.

[Hoare13] Tony Hoare, ‘Verified Concurrent Programmes: Laws of Programming with Concurrency’, Microsoft Research Concurrency Workshop, 2013, https://www.youtube.com/watch?v=G83nBjXBQCo, accessed Feb 2026.

[Hoare15] Tony Hoare, ‘The Laws of Programming with Concurrency’, 2015, https://www.youtube.com/watch?v=9kKQ8uLK8mk, accessed Feb 2026.

[Intel] Intel, ‘Local Serializer’, https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-12/local-serializer.html, accessed Feb 2026.

[Lamport78] Leslie Lamport, ‘Time, clocks, and the ordering of events in a distributed system’, Communications of ACM, July 1978, https://dl.acm.org/doi/pdf/10.1145/359545.359563.

[P3149R11] Ian Petersen, Jessica Wong, Dietmar Kühl, Ján Ondrušek, Kirk Shoop, Lee Howes, Lucian Radu Teodorescu, Ruslan Arutyunyan, ‘P3149R11: async_scope – Creating scopes for non-sequential concurrency’, https://wg21.link/P3149R11.

[Wikipedia-1] Wikipedia, Hoare logic, https://en.wikipedia.org/wiki/Hoare_logic, accessed Feb 2026.

[Wikipedia-2] Wikipedia, Calculus of communicating systems, https://en.wikipedia.org/wiki/Calculus_of_communicating_systems, accessed Feb 2026.

[Wikipedia-3] Wikipedia, Fork-join model, https://en.wikipedia.org/wiki/Fork-join_model, accessed Feb 2026.

[Wikipedia-4] Wikipedia, ‘Async/await’, https://en.wikipedia.org/wiki/Async/await, accessed Feb 2026.

Lucian Radu Teodorescu has a PhD in programming languages and is a Staff Engineer at Garmin. He likes challenges; and understanding the essence of things (if there is one) constitutes the biggest challenge of all. You can contact him at lucteo@lucteo.ro