In an Atomic World

In an Atomic World

By Lucian Radu Teodorescu

Overload, 32(182):6-12, August 2024


Atomics form a relatively low level, but fundamental part of sharing data across threads. Lucian Radu Teodorescu reminds us what atomics are and how and when to use them.

We often discuss mutexes as the basic building blocks of concurrency. However, there are more fundamental concepts upon which concurrent programs and synchronization primitives are constructed. The C++ language defines a memory model, which describes how programs behave when multiple threads are involved. Additionally, C++ introduces atomic operations that serve as foundation for working with data across threads, ensuring both safety and performance. The goal of C++ atomics is to closely align with the hardware and eliminate the need for lower-level operations that must work across threads.

The topic of atomics is often overlooked, and the prevailing advice is to avoid them. While this advice is generally sound, there are occasions when we need to use atomics to fully leverage the language’s capabilities. This article aims to give atomics the attention they deserve, as they have yet to be featured in an Overload article.

The subject of atomics is extensive. For a comprehensive exploration, readers are encouraged to consult books by Anthony Williams [Williams19] and Mara Bos [Bos23]. While the Bos book primarily focuses on Rust, there is still much to be learned about atomics for C++ programmers. The reader can also consider cppreference.com for a quick reference to the atomics library [cppreference-1] In this article, we will examine various memory ordering models and illustrate their usage through simplified practical examples.

Introduction to atomics

Many C++ applications share data between threads. If multiple threads are accessing the same data at the same time, and one of them is modifying the data, we have a race condition. In C++, a race condition leads to undefined behaviour.

A common solution for avoiding race conditions is the use of mutexes; when properly used, these can prevent race conditions. However, sometimes they are too heavy for some shared objects. For example, if we have a boolean flag that indicates when we need to stop a process, creating a mutex for this would be overkill. Instead of using mutexes, one can directly use std::atomic<bool> or std::atomic_flag for this example. In general, if the data being shared is just a primitive type (boolean, chars, integers, pointers), using atomics makes more sense.

An atomic operation is an indivisible operation. This means that no thread can observe partial results of applying that operation. If one writes an integer into a memory location, each thread that reads the value will either find the initial value or the new value.

An atomic type is a type that guarantees that all the operations that can be performed on the values of the type are atomic operations.

The C++ standard defines a series of atomic types. First, there is atomic_bool which is essentially a lock-free wrapper for atomic operations on booleans. Then, we have the templated atomic<T>, which has default specialisations for integers, bool, char types, and pointers. In C++23, we also have dedicated specialisations for shared_ptr and weak_ptr. In its general form, atomic<T> can be instantiated when T is a trivially copyable type. For more information, please see [cppreference-2].

For most of the article, we will focus on types like atomic<int> and atomic<bool>.

Let’s look at an example. Listing 1 shows a program that has two threads that communicate through an atomic object. The first thread waits for the atomic object to have the value true, and then prints something to the console. The second thread, after sleeping, sets the atomic object to the value true. If notified had not been an atomic object, this would have caused a data race, and thus, undefined behaviour.

std::atomic<bool> notified{false};
std::thread t1{[&notified] () {
  while (!notified.load()) 
    std::this_thread::yield();
  std::print(
    "received signal from other thread");
}
std::thread t2{[&notified] () {
  std::this_thread::sleep(100ms);
  notified = true;
}
Listing 1

Table 1 lists the main members of the atomic<T> class, where T is an integer type. All the operations are atomic, including those that contain multiple actions in their description.

Member Description
is_lock_free check if the atomic object is lock-free
operator= stores a value in the atomic object
store stores a value in the atomic object
load obtains the value of the atomic object
operator T obtains the value of the atomic object
exchange replaces the value of the atomic object and returns the previous value

compare_exchange_weak

compare_exchange_strong

compare the value of the atomic object with a given value; if equal perform an exchange, otherwise a load
wait block the thread until notified and the object value changes (C++20)
notify_one notify at least one thread waiting on the atomic object (C++20)
notify_all notify all threads waiting on the atomic object (C++20)

fetch_add

fetch_sub

fetch_and

fetch_or

fetch_xor

perform addition/subtraction/bit operation between the atomic object and a given value, and then return previous value

operator++ (pre/postfix)

operator--(pre/postfix)

increments/decrements the atomic object
Table 1

Depending on the type T given to an atomic<T> instantiation, and depending on the platform, the atomic may or may not be lock-free. To determine whether an atomic object is lock-free, users can call the is_lock_free function on the object (there is also a static constexpr variant to indicate if an atomic type is always lock-free). The standard guarantees that atomic_flag is always lock-free; all other atomic types may not be lock-free. That is, the library may use mutexes to implement these atomic types.

Most common platforms have atomic types for integers and pointers that are lock-free. For this article, we will assume this is the case.

Atomics can be used as primitives to build all the other synchronisation primitives. We will show some examples in this article.

Thread views

Let’s consider a function in a single-threaded application. In a naive world, the compiler would emit instructions that map exactly to the written program. But that would be slow. To prevent that, the standard allows the compiler to transform the code so that it can ‘optimise’ the code, as long as the transformed code has the same observable behaviour as the original code. This is called the as-if rule [cppreference-3].

For example, if a program writes multiple times to a variable without reading it, the compiler is allowed to generate code that will only write once to that variable, the final value. Similarly, if the program reads from memory more than it should, the compiler is allowed to remove some of the unnecessary reads. The compiler is also allowed to reorder instructions so that memory accesses have patterns that would lead to faster operations or reorder instructions to minimise the latency to some of the memory operations.

After the compiler generates the binary code, the program can be further transformed. The CPU can reorder instructions when executing them. Again, execution reordering also happens under the as-if rule. Another type of code transformation is the effect that processor caches have on the execution of the code: not all the memory operations will reach the main memory in the order they were written.

We discuss all this to point out that there is a big difference between how we write the code, what’s executed on a CPU core, and how the memory is actually accessed.

Things get a bit more complex when we look at multi-threaded applications, but the same as-if rule applies, with some caveats. One of these caveats is that there is no single ‘observable behaviour’ of the program. Each thread will have its own observable behaviour, its own view of the program execution. Then, we have a few rules that say how these views interact with each other. Besides the rules that apply when starting and ending a thread, C++ describes most of the interaction between views with the use of the synchronises-with relation.

We say that A synchronises-with B if operation A is a release store operation that produces results visible in operation B, which is an acquire load. We will cover the meaning of release store and acquire load below. When the two operations happen in different threads, the second thread has guarantees about the operations that happened before the store and the first thread has guarantees about the operations that happened after the load in the second thread. This synchronises-with relation is a fundamental relation that allows us to connect the views of different threads.

We will not describe the complex terminology that C++ standard has on this matter, but the goal is to be able to say that operation A happens before operation B (A happens-before B), when A and B are operations that may be executed on different threads. This allows us to have a partial ordering of the operations on different threads, even if we have multiple views.

Once again, the reader should bear in mind that one thread may see operation A happening before operation B, while another thread may see B happening before A.

Let’s look at this graphically. In Figure 1, we have two threads (represented by two grey boxes), each containing two operations: the first thread has two stores for variables x and y, while the second thread loads the values of these variables from memory. Let’s assume that both variables are initialised with 0 before the two threads run. The figure indicates with a thick cross-thread arrow a synchronises-with relation between the store of the y variable and the load from the second thread; that is, for this particular execution, y.load() will obtain the value that was written by y.store(2). Using a sequentially-consistent ordering (more on this later), we make sure that the store to x happens before the store to y, and that the load from y happens before the load from x. All these ordering relations are seen by both threads and are represented by thick full arrows. Thus, in all the views, the load from x will see the value 1 as written from the first thread (assuming there is no other intermediate store to this variable by a different thread).

Figure 1

Figure 2 shows the same type of instructions, but with relaxed memory ordering (to be covered shortly). In this case, the relation between the operations on the same thread are enforced only in the local view of that thread; we denote this by using thinner, dotted arrows. For the first thread, the store to x happens before the store to y. But the second thread might see those writes out of order; it might see the store to y before the store to x. Thus, the second thread might load value 2 for y, but a value of x == 0.

Figure 2

This illustrates the idea that threads might have different views on the order of operations. One of the most frequent causes for this behaviour is the use of processor caches.

The difference between different thread views can be seen by running the code in Listing 2. From the perspective of the first thread, we have either x == y or x == y+1; thus x >= y. This inequality doesn’t translate well in the view of the second thread. Here, we load y before x. Thus, x should always be greater than y, either because that was the state of x and y when the second thread started reading, or because x got the chance to have a newer (greater) value between the two loads. However, when we look at the result, we see that occasionally we have xx < yy, thus we get the counter incremented, and the program prints a non-zero mismatch count. A non-zero value printed out proves that the two threads might have different views. On my machine (MacBook Pro, Apple M2 Pro processor), for a given run, I’ve obtained 258 mismatches. (Note, a non-zero result is not guaranteed, but on many machines, it is a probable result).

std::atomic<int> x = 0;
std::atomic<int> y = 0;
std::thread t1{[&] {
  for (int i = 0; i < 1000000; ++i) {
    x.store(i, std::memory_order_relaxed);
    y.store(i, std::memory_order_relaxed);
  }
}};
std::thread t2{[&] {
  int count_mismatch = 0;
  for (int i = 0; i < 1000000; ++i) {
    auto yy = y.load(std::memory_order_relaxed);
    auto xx = x.load(std::memory_order_relaxed);
    if (xx < yy) count_mismatch++;
  }
  printf("Mismatch count: %d\n", count_mismatch);
}};
t1.join();
t2.join();
Listing 2

Memory ordering

The way that the thread views are aligned for atomic operations depends on the memory ordering applied to the atomic operations. The C++ standard defines the following memory ordering values: relaxed, consume, acquire, release, acquire-release, and sequentially-consistent. We won’t discuss the consume memory ordering here; it is not well supported by the compilers, and the rules surrounding it are hard to reason about.

Each atomic operation uses one of these memory ordering values. The most relaxed, as the name suggests, is the relaxed mode; this mode doesn’t add any ordering guarantees. On the other side, the sequentially-consistent model, which is the default, will ensure that all the threads have the same views on the ordering.

To discuss memory ordering, we will refer to how different instructions are translated to machine code. For this, we will focus on just two architectures commonly found in practice: x86-64 and ARM64. The x86-64 platform is a good example of a strongly-ordered architecture, while the ARM64 platform is a good example of a weakly-ordered architecture [Preshing12]. As we shall see below, strongly-ordered architectures have stronger guarantees than weakly-ordered architectures for regular operations.

Relaxed memory ordering

Atomic operations that use relaxed memory ordering will typically generate code as if no atomics were used. Table 2 shows the generated code for x86-64 and ARM64 for storing 0 in an integer in two different ways: using a plain int type and using atomic <int> with a relaxed store. On neither platform is there a difference between the two operations.

C++ code x86-64 ARM64
void s1(int* x) {
  x = 0;
}

void s2(atomic<int>& x) {
  x.store(0, memory_order_relaxed);
}
s1(int*):
  mov DWORD PTR [rdi], 0
  ret

s2(atomic<int>&):
  mov DWORD PTR [rdi], 0
  ret
s1(int*):
  str wzr, [x0]
  ret

s2(atomic<int>&):
  str wzr, [x0]
  ret
Table 2

A similar thing happens for loads. Table 3 shows that there is no difference between the generated code for loading the value from an int* object or from an atomic<int> using relaxed memory order, on neither x86-64 nor ARM64. From these, we can conclude that relaxed atomic load/store operations behave like regular loads/stores.

C++ code x86-64 ARM64
int ld1(int* x) {
  return *x;
}

int ld2(atomic<int>&x) {
  return x.load(memory_order_relaxed);
}
ld1(int*):
  mov eax, DWORD PTR [rdi]
  ret

ld2(atomic<int>&):
  mov eax, DWORD PTR [rdi]
  ret
ld1(int*):
  ldr w0, [x0]
  ret

ld2(atomic<int>&):
  ldr w0, [x0]
  ret
Table 3

If there is no difference in the generated code between a relaxed atomic and the regular code, the reader might rightfully ask why we need relaxed atomics. The answer is twofold. First, relaxed atomics allow us to use the same variable in multiple threads without race conditions. Second, the usage of atomics will prevent the compiler from doing certain operations: it can’t omit memory loads and memory stores, and it cannot reshuffle the code to generate more instructions than needed. For example, for regular integers, an operation like x = 1 can be transformed by the compiler to look like x = 0; x++; – such a transformation is forbidden when using atomics.

In terms of performance, relaxed atomics don’t incur additional costs compared to regular instructions. They do prevent certain compiler transformations, so using relaxed atomics, depending on their use, might be slower than not using atomics at all. In general, we should limit the usage of atomics to just essential variables needed for synchronisation between threads.

As mentioned above, using relaxed atomics won’t make the operations viewed by a thread consistent with what other threads view. See Figure 2 and Listing 2.

A typical usage for relaxed operations is in incrementing counters, since most of the time we only require atomicity; note, however, that decrementing counters typically require acquire-release synchronisation.

Acquire and release memory ordering

Acquire and release typically work together as they create synchronises-with relations. Acquire memory ordering only has meaning for load operations, while release memory ordering only applies to stores. For operations that perform both stores and loads (like compare_exchange_strong), one can use the std::memory_order_acq_rel memory order.

When one performs a store that has a release ordering, it forces all the previous operations to make their effect visible before the store is actually visible. The C++ standard states that no read or write operations before a release operation can be reordered to happen after the release store.

When one performs a load that has an acquire ordering, it forces all the subsequent memory loads to use data obtained after the acquire operation. The C++ standard states that no read or write operations after the acquire operation can be reordered to happen before the acquire load.

A pair of a release store and an acquire load can form a synchronises-with relation. Figure 3 shows this relation, being similar to the examples shown in Figures 1 and 2. If we make the store to y use release memory ordering, and the load from y use acquire memory ordering, then the two threads can form a synchronises-with relation when the second thread reads the value that was published by the first thread. That means we can gain consistency between the views of the two threads. Even if the store to x is relaxed in the first thread, because the store to y has release memory ordering, the second thread will have to see its effect when acquiring the value of y; this means that the second thread will always see x stored before y. This means that the four operations in Figure 3 are perfectly ordered, and both threads see the same thing.

Figure 3

To prove this, we can slightly adapt the code in Listing 2. For the line containing y.store, we can use std::memory_order_release instead of relaxed ordering; also, we can use y.load(std::memory_order_acquire) instead of relaxed memory order. Making these two changes will ensure that the code in Listing 2 guarantees that the printed number of mismatches is zero. The reasoning is exactly the same as the one we had for Figure 3.

Taking a lock has acquire semantics: no instructions after the point where the lock is taken can be seen by threads that use the lock prior to taking the lock. Similarly, releasing a lock has release semantics: no instructions before the lock is released can be reordered to be seen by the threads using the lock after the release point. This is demonstrated by the code in Listing 3. This code will ensure that the modifications to the two counter variables are done in protected regions, without any race conditions.

std::atomic<bool> locked_flag = false;
int counter1 = 0;
int counter2 = 0;
auto f = [&] {
  for (int i = 0; i < 1000000; ++i) {
    // Simulate acquiring the lock.
    while (locked_flag.exchange(true, std::memory_order_acquire))
      ;
    // Protected operations.
    counter1++;
    counter2 = counter1;
    // Simulate releasing the lock.
    locked_flag.store(false, 
      std::memory_order_release);
  }
};
std::thread t1{f};
std::thread t2{f};
std::thread t3{f};
t1.join();
t2.join();
t3.join();
assert(counter1 == 3000000);
assert(counter2 == 3000000);
Listing 3

Let’s now look at the translation of acquire loads and release stores on the two major platforms. Table 4 shows how the translations look. Comparing this with the translation from Table 3, we conclude that on x86-64 there is no difference between relaxed loads and acquire loads, and between relaxed stores and release stores; the same code is generated. However, on ARM64, the astute reader will spot the differences. A relaxed store is translated into str (store register), while a release store is translated into an stlr instruction (store-release register); similarly, a relaxed load is translated into ldr (load register), while an acquire load is translated into ldar (load acquire register).

C++ code x86-64 ARM64
void store3(atomic<int>& x) {
  x.store(0, memory_order_release);
}

int ld3(atomic<int>&x) {
  return x.load(memory_order_acquire);
}
store3(atomic<int>&):
  mov DWORD PTR [rdi], 0
  ret

ld3(atomic<int>&):
  mov eax, DWORD PTR [rdi]
  ret
store3(atomic<int>&):
  stlr wzr, [x0]
  ret

ld3(atomic<int>&):
  ldar w0, [x0]
  ret
Table 4

On weakly-ordered platforms like ARM64, release/acquire operations are more expensive than relaxed atomic operations. On strongly-ordered platforms like x86-64, there is no difference between relaxed atomics and release/acquire atomics. One can say that release/acquire operations are as cheap as relaxed memory operations, but also that relaxed memory operations are as expensive as release/acquire operations.

Sequentially consistent memory ordering

There is another ordering that provides more guarantees than acquire-load memory ordering, but can also be more expensive. This is sequentially-consistent memory order. In C++, this is represented by the std::memory_order_seq_cst enum value, and it is the default memory ordering for atomic operations.

Table 5 shows the generated instructions for the two selected platforms. Here, it’s interesting to notice that on ARM64, the same instructions are generated as for acquire/release operations. However, there are differences on x86-64 platforms; while the load operation generates the same instruction, the store translates into different code. For this, the xchg (exchange register/memory with register) instruction is used; this is typically used for swaps and performs a store and a load operation. This indicates that a sequentially-consistent store on x86-64 is more expensive.

C++ code x86-64 ARM64
void store4(atomic<int>& x) {
  x.store(0, memory_order_seq_cst);
}

int ld4(atomic<int>&x) {
  return x.load(memory_order_seq_cst);
}
store4(std::atomic<int>&):
  xor eax, eax
  xchg eax, DWORD PTR [rdi]
  ret

ld4(std::atomic<int>&):
  mov eax, DWORD PTR [rdi]
  ret
store4(std::atomic<int>&):
  stlr wzr, [x0]
  ret

ld4(std::atomic<int>&):
  ldar w0, [x0]
  ret
Table 5

Sequentially-consistent memory ordering provides more guarantees compared to acquire/release semantics. In particular, it guarantees a single total modification ordering of all the operations that are tagged. By contrast, acquire/release semantics only add constraints between two threads, and they are not concerned with global ordering.

To spot the difference let’s look at the code from Listing 4 (assuming each function is called in parallel on a different thread). The read_x_then_y function waits until x becomes true, and then loads y; if y is true, then it increments z. The read_y_then_x function does the same things but swaps x and y. They are just reading the atomic values, so using acquire/release semantics we can’t make those two functions agree on the order in which x and y become true. In the acquire/release world, there is no relation that synchronises their view of the variables. Thus, if this code were to use acquire/release semantics, after running all the functions in parallel, we might end with a zero value of z. That is, read_x_then_y would see x==true and y==false, while read_y_then_x would see y==true and x==true.

atomic<bool> x = {false};
atomic<bool> y = {false};
atomic<int> z = {0};
void write_x() {
  x.store(true);
}
void write_y() {
  y.store(true);
}
void read_x_then_y() {
  while (!x.load());
  if (y.load()) ++z;
}
void read_y_then_x() {
  while (!y.load());
  if (x.load()) ++z;
}
Listing 4

This cannot happen in sequentially-consistent memory ordering (the memory ordering used in the example); both read functions would see the same order in which x and y. To repeat, in the acquire/release model, we can only synchronise between the view of the thread that does the release and the view of the thread that does the acquire. In the sequentially-consistent model, we synchronise between all thread views.

Some techniques and real-life examples

Mutexes and busy loops

We’ve seen in Listing 3 a sketch of building a mutex on top of atomic operations. If there is no contention on the lock, this implementation might be fast and appropriate. But, as soon as one thread needs to wait, we have a performance problem; the thread will spin in a tight loop and consume all its CPU quota.

Possible alternatives here include introducing loop instructions that will briefly pause the thread. Processor yields, thread scheduler yields, or even sleeps are good strategies for pausing the threads (each having different pausing times and characteristics).

Luckily for us, C++20 provides a portable way to do this waiting on spin-loops. Listing 5 shows how the code needs to be modified to incorporate this into our spin-mutex implementation. In this code, the wait call waits until locked_flag is not true anymore; this will use the best strategy known for the detected processor/OS type (often a combination of the above). The notify_one call will ensure that the wait call wakes up on platforms where wait sleeps in kernel space. See [Doumler20] for a discussion on building spin-mutexes and [Giroux] for a possible implementation of wait. Also see [Pikus22] for a quick discussion on building spin mutexes.

// Acquire the lock.
while (locked_flag.exchange(true,
    std::memory_order_acquire))
  locked_flag.wait(true,
    std::memory_order_relaxed); // NEW
...
// Release the lock.
locked_flag.store(false,
    std::memory_order_release);
locked_flag.notify_one(); // NEW
Listing 5

Reference counting

Listing 6 shows how a reference counting mechanism can be implemented, similar to what we find in std::shared_ptr. As long as we have a valid reference to the object (at least one inc is called without dec), the object is kept alive; as soon as we drop all the references to this ref-counted object, the inner object will be destructed too. Here, the assumption is that the first time we call inc() to have a non-zero reference, we can have at most one thread; this typically happens in the constructor.

template <typename T> class ref_counted {
  atomic<int> count_;
  unique_ptr<T> data_;
public:
  void inc() { count_.fetch_add(1,
    memory_order_relaxed); }
  void dec() {
    if (count_.fetch_sub(1,
        memory_order_acq_rel) == 1)
      data_.reset();
  }
  // ...
};
Listing 6

When calling inc, we don’t care about synchronising thread views. There are no loads and stores that depend on the ordering of this (assuming that dec is properly ordered after inc, which is ensured by other mechanisms). This means that a relaxed memory order is enough for what we need.

When calling dec, things are slightly different because we are also touching data_. In the thread that calls dec, all the memory access to data_ before the call to dec should not be moved after the call to dec; this implies release semantics. Also, the code that deletes data_ (when the counter is decremented to zero) which accesses its content should not be reordered before the fetch_sub call on the atomic; this implies acquire semantics.

In this example, we’ve shown the use of memory_order_acq_rel, that is commonly used with atomic operations that perform both reads and writes. Prime examples of such instructions are fetch_add and fetch_sub, also shown in this example.

CAS loops and a simple stack example

The C++ standard provides a few examples of atomic operations that do both reads and writes. However, it can’t provide all the operations that a user might need. There needs to be a technique for users to implement atomic operations out of simpler operations.

Such a technique is Compare-and-swap (CAS) loop [Wikipedia]. In C++ this can be achieved with the use of compare_exchange_weak and compare_exchange_strong methods of atomic<T>; these methods change the atomic to a desired value if they have an expected value. To make this work, these instructions are typically put inside a loop.

The signature of these methods, in their most generic form, is bool compare_exchange_strong(T& expected, T desired, memory_order success, memory_order failure) noexcept; (same for the _weak version). This will compare the value in the atomic object to expected; if they have the same value, it will store desired in the atomic object and return true. If they have a different value, it would update expected to the current value in the atomic object and return false. Updating the expected value in case of a failure can be very useful in CAS loops, as most of the time we need to check the current value of the atomic object in case of a failure. The success memory order is the one that should be applied to the read-modify-write if the operation succeeds; the failure memory order is the one that should be applied to the load operation in the case of a failure.

Unlike the _strong version, the _weak version is allowed to fail spuriously; that is, it can return false even if the atomic object has a value equal to expected. On some architectures, the only way to implement the _strong version is to have inner loops; but, as we often put these constructs inside other loops, we can directly utilise the _weak version. Thus, the rule of thumb is that, if the compare-and-swap is put in a loop, one should use the _weak version, and if not, one should use the _strong version.

Listing 7 presents an example of implementing the operation of pushing into a stack. After creating the new node, we need to chain it to the head of our singly-linked list. This requires two changes: the next_ pointer of the new node needs to point to the current head, and the head needs to point to the new node. These two changes need to happen atomically. To make this happen, we try to set the current list head as the element following the new node, and then atomically compare-and-swap the head node. If the compare_exchange_weak operation succeeds, we still have the same head_ value, and the connection is made correctly; if the operation fails, then we will load the new head, putting it directly into new_node->next_ (as a side effect of the compare_exchange_weak instruction) and try again. The chances of converging the loop in a small number of iterations are very high even in the case of contention.

template <typename T> struct node {
  T data_;
  node* next_;
  node(const T& data) : data_(data), next_(nullptr) {}
};
template <typename T> class stack {
  atomic<node<T>*> head_;
public:
  void push(const T& data) {
    node<T>* new_node = new node<T>(data);
    new_node->next_ = 
      head_.load(memory_order_relaxed);
    while (!head_.compare_exchange_weak(
      new_node->next_, new_node,
      memory_order_release, 
      memory_order_relaxed));
  }
};
Listing 7

The reader might remark that next_ is not an atomic type. It doesn’t need to be, as we always interact with it through the head_ pointer. Storing any value to it will synchronise with other threads that might read data from it (not shown in our example), as the store of head_ has release semantics; this ensures that all the previous stores will be visible before the store to head_ (for the threads that synchronise with this thread on an acquire-load of head_).

We don’t have any subsequent load operations that need to be synchronised with the exchange of the head pointer. This means that we don’t need acquire semantics for all the loads we have. This means that we can safely use relaxed loads when reading the content of head_.

We should reiterate a key point of this algorithm: the new_node->next_ is updated when the compare_exchange_weak operation fails. This actually prepares the new node for adding it again to the head of the list, thus ensuring the precondition needed for changing the head of the stack. Without this, the loop of the while instruction would have some more operations.

Some discussions

Atomics are very powerful. They allow programmers to exploit low-level primitives for the platform in order to generate fast concurrent code. One can build all concurrency primitives on top of the atomics library provided by the C++ standard, and they would probably be very fast. Atomics lack deep integration with the OS thread scheduler, but other than that, they have the potential to express most concurrency primitives in the most efficient way.

However, with great power comes great responsibility and a higher chance of shooting yourself in the foot. If C++, in general, is considered a language in which one can easily shoot themselves in the foot, using atomics is like juggling with hand grenades. In this section, we cover a few reasons why we should avoid using atomics for most of the programs we write.

Atomics break local reasoning

Atomic primitives are like goto instructions. They are fundamental at a lower level (e.g., generated machine code), but should be avoided in high-level programming. One of the reasons for this is that they both break local reasoning.

Local reasoning, a fundamental idea in structured programming, is the ability to look at a code fragment in isolation and determine its meaning, without knowing what the rest of the program does.

One key aspect of reasoning about atomics is understanding how different thread views are correlated. That is, to reason about a code fragment that runs on a thread, we need to understand how the other threads are viewing different operations. Local reasoning is possible only when thread views are either equivalent or they don’t intersect at all.

Using release/acquire semantics, one needs to reason about which stores are “published” on a release operation, and which loads are guaranteed to be ordered on an acquire operation. Oftentimes, this reasoning expands beyond the local code that uses atomics. A good example of this is the primitives used to build a spin mutex (see Listing 3). The acquire semantics that apply to the step of acquiring a lock need to apply to the protected region; similarly, the effects of the protected region need to be “published” by the release semantics of the unlocking part.

As it’s easy to misuse atomics, and as bugs are typically hard to catch (the code can run in production for years before the bugs manifest), reasoning about the code that uses atomics is of utmost importance. And, at this point, we don’t seem to have good tools to help us dealing with atomics.

Release/acquire is better than sequentially consistent semantics

The C++ standard takes the stand that the default memory ordering is sequentially consistent, even if it is not needed most of the time. It is the ordering that provides the most guarantees, thus it’s safer. While this is true, it may encourage less reasoning, which may lead to bugs.

As argued above, reasoning about atomics is crucial. The reasoning process needs to include the guarantees that an operation needs in order to ensure correctness. But, if we fully reason about the use of atomics, then it makes little sense to use a suboptimal setting when it’s clear that it’s suboptimal.

Using sequential consistency atomics is often a sign that the reasoning about the atomics was not carried through to the end. Thus, similar to the position of Mara Bos [Bos23], we recommend using the appropriate memory ordering, instead of sticking to the default of sequentially consistent. And, of course, documenting the rationale helps.

Document the choice of memory ordering

The reasoning performed when choosing the memory model should not be easily lost. Thus, we recommend adding comments to document such reasoning. For example, after an acquire load, document what other loads are dependent on this load. Also, for a release store, document which other stores need to be published. If using sequential consistency, document the cases in which release/acquire semantics are not enough.

As using atomics often implies non-local reasoning, documenting expected behaviour, especially in relation to non-local items, is important for later readings of the code.

Prefer using higher-level concurrency abstractions

Using mutexes is generally easier than using atomics for most use cases. Using barriers and latches can be easier than using mutexes and conditional variables. Although atomics can be more efficient, not all code needs to be optimised to the maximum. We should measure the performance of a concurrent system before deciding to fully optimise it. Thus, for most projects, using higher-level concurrency primitives is the right thing to do.

Continuing on this line of thought, we should use concurrency primitives that invite users to express high-level concurrency constraints, such as dependencies between two or more work chunks (see [Teodorescu24]). In C++ one can use the senders/receivers framework [P2300R10] (maybe with extra utilities on top of it)1.

Using fast low-level concurrency is typically a worse strategy than using appropriate high-level concurrency primitives for many applications. This is true both in terms of maintainability and performance.

Atomics are powerful primitives, but perhaps too powerful for most applications. Maybe raw power is not always the answer. Having good strategies often produces better results.

References

[Bos23] Mara Bos, Rust Atomics and Locks: Low-Level Concurrency in Practice, O’Reilly Media, 2023, https://www.oreilly.com/library/view/rust-atomics-and/9781098119430/.

[cppreference-1] cppreference, ‘Concurrency support library’, https://en.cppreference.com/w/cpp/thread, accessed June 2024.

[cppreference-2] cppreference, std::atomic, https://en.cppreference.com/w/cpp/atomic/atomic , accessed June 2024.

[cppreference-3] cppreference, ‘The as-if rule’, https://en.cppreference.com/w/cpp/language/as_if, accessed June 2024.

[Doumler20] Timur Doumler, ‘Using locks in real-time audio processing, safely’, posted on Timor.Audio on 14 April 2020, https://timur.audio/using-locks-in-real-time-audio-processing-safely.

[Giroux] Olivier Giroux, ‘Sample implementation of C++20 synchronization facilities’ on GitHub, https://github.com/ogiroux/atomic_wait/, accessed June 2024.

[P2300R10] Michał Dominiak, et al, P2300R10: std::execution, ISO/IEC JTC1/SC22/WG21, https://wg21.link/P2300R10.

[Pikus22] Fedor Pikus, A Spinlock Implementation (Lightning Talk), CppCon 2022, https://www.youtube.com/watch?v=rmGJc9PXpuE.

[Preshing12] Jeff Preshing, ‘Weak vs. Strong Memory Models’ posted on Preshing on Prgramming (blog) on 30 Sept 2012, https://preshing.com/20120930/weak-vs-strong-memory-models/

[Teodorescu24] Lucian Radu Teodorescu, ‘Concurrency: From Theory to Practice’, Overload 181, June 2024, https://accu.org/journals/overload/32/181/teodorescu/.

[Wikipedia], Wikipedia, Compare-and-swap, https://en.wikipedia.org/wiki/Compare-and-swap.

[Williams19] Anthony Williams, C++ concurrency in action (2nd edition), Manning, 2019.

Footnote

  1. If the reader hasn’t heard yet, it is my great pleasure to announce that the P2300 paper was voted, in the June 2024 plenary in St. Louis, to be included in the upcoming C++26 standard.

Lucian Radu Teodorescu has a PhD in programming languages and is a Staff Engineer at Garmin. He likes challenges; and understanding the essence of things (if there is one) constitutes the biggest challenge of all.






Your Privacy

By clicking "Accept Non-Essential Cookies" you agree ACCU can store non-essential cookies on your device and disclose information in accordance with our Privacy Policy and Cookie Policy.

Current Setting: Non-Essential Cookies REJECTED


By clicking "Include Third Party Content" you agree ACCU can forward your IP address to third-party sites (such as YouTube) to enhance the information presented on this site, and that third-party sites may store cookies on your device.

Current Setting: Third Party Content EXCLUDED



Settings can be changed at any time from the Cookie Policy page.