



### Disclaimer

- This is for (concurrency) experts only and some features are deliberately ugly to scare non-experts off.
- If you use anything presented here and cause damage, it's your own fault!
- This will present atomics, not lock-free algorithms.
- You will even see some assembler here.

# From the Standard (FCD)

- "This Clause describes components for fine-grained atomic access. This access is provided via operations on atomic objects.<sup>341</sup>"
- "<sup>341</sup>Atomic objects are neither active nor radioactive."

# From Hans' and Paul's C++0x FAQ

- When should I use low level atomic operations?
- You shouldn't unless both:
  - The alternatives lead to inadequate performance, and you've determined that high level atomics are the problem, and
  - you sufficiently understand the relevant memory ordering issues, something that is generally well beyond this FAQ.
- Generally low level atomic operations are intended for implementors of a few other performance critical libraries.
- We expect that in some cases coding standards will prohibit the use of low level atomic operations.

# What is "atomic"?

- From wiktionary:
- "Unable to be split or made any smaller."



# What is atomic? From wiktionary:

- "Said of an operation that is guaranteed to either complete fully, or not at all."
- So an atomic operation is a transaction.









## Analysis of Double-Checked Locking (DCL)

- Scott was essentially right.
- Third problem: Hardware Instruction Re-Ordering
- Atomics are here to solve them all!
- Really?

# **Compiler Optimization**

- From ISO/IEC JTC1 SC22 WG21 N2427, "C++ Atomic Types and Operations" by Hans Boehm and Lawrence Crowl:
- "We propose to add atomic types and operations purely as a library API. In practice, for C++, this API would have to be implemented largely with either compiler intrinsics or assembly code. (As such, this proposal should be implemented by compiler vendors, not library vendors, much as the exception facilities are implemented by compiler vendors.) For C, a compiler implementation is required for the type-generic macros."
- Atomics are here to inhibit unwanted compiler optimizations.









# **Memory Coherence**

- Data written to cache might not be seen by another core for some time.
- Writes from cache to memory are often in different order than writes from core to cache.
- Processor architectures provide specific mechanisms to control when data is seen by other devices (cores or DMA devices).

# Double-Checked Locking (DCL)

# What is "atomic"?

- The atomic operations in C++ are actually about control of ordering:
  - instruction reordering by compiler
  - instruction reordering by processor
  - memory ordering in common memory
- They are also atomic in the sense of "transactional".



# Atomic Operations

- Specification for fetch\_add:
- Effects:
- Atomically replaces the value pointed to by this with the result of addition applied to the value pointed to by this and the given operand.
- Memory is affected according to the value of order.
- These operations are atomic read-modify-write operations (1.10).



# memory\_order relaxed

- No memory ordering.
- "atomic" as "transactional"

### • Implementation on PowerPC:

| a.fetch_add(i, |          | <pre>memory_order_relaxed):</pre> |
|----------------|----------|-----------------------------------|
| 1:             | lwarx    | 80,0,82                           |
|                | add      | %0,%0,%3                          |
|                | stwcx    | 80,0,82                           |
|                | bne      | 1b                                |
|                |          |                                   |
| 80:            | tmp      |                                   |
| %2 <b>:</b>    | &a.value |                                   |
| 83:            | i        |                                   |
|                |          |                                   |

### lwarx and stwcx

- Load Word and Reserve Indexed
- Store Word Conditional Indexed
- From PowerPC manual:

"The concept behind the use of the **lwarx** and **stwcx** instructions is that a processor may **load a semaphore** from memory, compute a result based on the value of the semaphore, and conditionally store it back to the same location (only if that location has not been modified since it was first read), and determine if the store was successful.

# lwarx **and** stwcx

The conditional store is performed based upon the existence of a reservation established by the preceding **lwarx** instruction.

If the reservation exists when the store is executed, the store is performed and a bit is set in the CR. If the reservation does not exist when the store is executed, the target memory location is not modified and a bit is cleared in the CR.

27

### lwarx and stwcx

If the store was successful, the sequence of instructions from the read of the semaphore to the store that updated the semaphore appear to have been executed atomically (that is, no other processor or mechanism modified the semaphore location between the read and the update), thus providing the equivalent of a real atomic operation. However, in reality, other processors may have read from the location during this operation.

### lwarx and stwcx

In the MPC885, the reservations are made on behalf of aligned 16-byte sections of the memory address space. The **Iwarx** and **stwcx** instructions require the EA to be aligned. Exception handling software should not attempt to emulate a misaligned **Iwarx** or **stwcx** instruction, because there is no correct way to define the address associated with the reservation.

In general, the **lwarx** and **stwcx** instructions should be used only in system programs, which can be invoked by application programs as needed.

### lwarx and stwcx

At most, one reservation exists simultaneously on any processor. The address associated with the reservation can be changed by a subsequent **Iwarx** instruction. The conditional store is performed based upon the existence of a reservation established by the preceding **Iwarx**, regardless of whether the address generated by the **Iwarx** matches that generated by the **stwcx** instruction.

A reservation held by the processor is cleared by one of the following:

- Executing an **stwcx** instruction to any address
- Attempt by another device to modify a location in the reservation granularity (16 bytes)"

# memory\_order\_relaxed

```
// Initialization:
atomic<int> x(0), y(0);
int r1, r2;
// Thread 1:
r1 = y.load(memory_order_relaxed); (2)
x.store(r1, memory_order_relaxed); (3)
// Thread 2:
r2 = x.load(memory_order_relaxed); (4)
y.store(42, memory_order_relaxed); (1)
Result:r1 = r2 = 42.
```

### memory order consume

- "don't move any subsequent 'dependent' memory load before this operation"
- Some implementations implement this by just doing a memory\_order\_aquire.

## memory\_order acquire

- also RMB: read memory barrier
- "don't move any subsequent load before this barrier"
- or
- "invalidate any cache locations that are read after this"
- This should be done if you enter some synchronized section.

# memory\_order\_release also WMB: write memory barrier "don't move any previous store after this barrier" or "publish everything that's done until now" This should be done on leaving some synchronized section.

### memory order acq rel

- Full memory barrier: both, read and write memory barrier.
- "Don't move any memory access across this barrier."

### memory order seq cst

- Sequential consistency:
- "Don't move any instructions."
- This is about hardware instruction reordering. - and full memory ordering.
- This is pretty safe.
  - ... and it's the default.
- And as Bartosz Milewski says:
   "Any time you deviate from sequential consistency, you increase the complexity of the problem by orders of magnitude."

tp://bartoszmilewski.wordpress.com/2008/12/01/c-atomics-and-memory-ordering/









# **DCL** Performance

- From PowerPC manual:
- "The functions performed by the sync instruction normally take a significant amount of time to complete; as a result, frequent use of this instruction may adversely affect performance."

# Standard Locking Performance

- A good mutex (futex) implementation uses user-space atomic operations
- only for the non-contention case.
- Faster than safe version of DCL with atomics.
- Safer than fast version of DCL with atomics.

# When To Use atomics?

- When you're an expert.
- When you need lock-free behaviour.
- E.g. interrupt service routines.

# Atomic Operations

- struct atomic\_flag:
- bool test\_and\_set(memory\_order)
- sets the flag, returns old value
- void clear(memory\_order)
- clears the flag

### • no load()!









# atomic<T> Operations

- For any trivially copyable types.
- is\_lock\_free()
- atomic(T): non atomic
- store/load
- operator I(): atomic conversion (load)
- operator=(): just α seq\_cst store
- exchange, fcompare\_exchange

### Fences

- void atomic\_thread\_fence(memory\_order order);
- void atomic\_signal\_fence(memory\_order order);
- Effects:

equivalent to atomic\_thread\_fence (order), except that synchronizes with relationships are established only between a thread and a signal handler executed in the same thread.

# References

- Programming Languages C++, FCD
   http://www.open-std.org/itc1/sc22/wg21/docs/papers/2010/n3092.pdf
- ISO/IEC JTC1 SC22 WG21 N2427: "C++ Atomic Types and Operations" by Hans Boehm and Lawrence Crowl http://www.open-std.org/itc1/sc22/wg21/docs/papers/2007/n2427.html
- Paul E. McKenney: "Memory Ordering in Modern Microprocessors"
- http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
- Hans Boehm, Paul McKenney: "Programming with Threads: Questions Frequently Asked by C and C++ Programmers"

http://www.rdrop.com/users/paulmck/scalability/paper/c++0x\_user\_faq.html

51

### References

- Bartosz Milewski: "C++ atomics and memory ordering" http://bartoszmilewski.wordpress.com/2008/12/01/c-atomics-and-memory-ordering/
- "Double-Checked Locking is Broken Declaration" http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html
- Scott Meyers: "Double-Checked Locking, Threads, C++ Compiler Optimizations, and More"
- Presentation at ACCU Conference 2006 • Freescale: "MPC885 PowerQUICC™ Family Reference
- Preescale: "MPC885 PowerQUICC" Family Refere Manual