ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pin5 Big Fat Reasons Why Mutexes Suck Big Time

Overload Journal #149 - February 2019 + Design of applications and programs   Author: Sergey Ignatchenko
Mutable shared state in multithreaded code is often protected by mutexes. Sergey Ignatchenko reminds us that Re-Actors can avoid many of the problems.

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Amicus Plato, sed magis amica veritasPlato is my friend, but truth is a better friend
~ Aristotle

Of course, I am perfectly aware that an all-powerful multithreaded inquisition will try to burn me at the stake for publishing this kind of heresy. Still, I do believe in the truth (at least as I understand it), so – trying to keep in line with Aristotle, mentioned in the epigraph – I have no choice other than to try to bring these things (that have been known for ages but conveniently forgotten way too often) to the forefront.

And yet it moves they suck!
~ Galileo Galilei‘No Bugs’ Hare

On terminology

Just to be clear: in this article I am not attacking mutexes in a narrow sense, so don’t think that a critical section or an OpenMP lock is in any way better; rather, all techniques which provide mutual exclusion functionality are equally bad for the purposes of this article. These include, but are not limited to:

  • All kinds of mutexes (whether recursive or not, single-process or multi-process, etc. etc.)
  • Critical sections (including both Windows ones and OpenMP ones)
  • Semaphores used for mutual exclusion
  • Java synchronized objects.

A bit of history

…in the Beginning there was only The Darkest Darkness, and in the Darkness – Magnetics, which unwound the atoms; spinning, the atom struck the atom, and Pra-Current arose, and with it the First Light... and the stars lit up, the planets cooled, and mini-micro-Pramachines emerged, and from them Pramachines developed, and then, by the Providence of Holy Statistics – Protomachines. They still could not count, they only knew what is two plus two, and nothing more, but then, thanks to the Natural Evolution, somehow they got the hang of it, and the Multi-Stats and Omnistats originated from them, and then there was Pithecanthrobotus and from him our forefather, Automatus Sapiens…
~ Stanislaw Lem, The Cyberiad

In the very beginning, Stone Age programmers were programming directly on the hardware, without any intermediaries. However, by 1956, it was realized that app-level programs need some kind of a helper program between them and hardware, and the first operating system (intended for one single computer!) was written [Wikipedia]; it was the beginning of the Bronze Age of programming.

With the advent of a common OS for multiple computers (OS/360 in the mid-60s), computing progressed into the Iron Age. It was during the Iron Age that threads were first introduced (as ‘tasks’ in OS/360 in 1967 or so). Still, the dominant approach to programming at the time was via processes rather than via threads. From our current perspective, the main difference between processes and threads is that with processes we don’t normally have shared memory, and have to communicate using more structured IPC means.

By the mid-70s, computing got microprocessors (including my personal favorite, i8080A) and has reached Ancient Times. By the early 80s, microprocessors have become ubiquitous and have started to make inroads into our homes.

By the late 80s, POSIX threads had been developed [McCracken02], and by the early 90s, (originally thread-oriented) WinNT was released. However, threads were still seriously unpopular with developers; just as one example, Linux didn’t get even a half-decent implementation of POSIX threads until LinuxThreads in 1996 (and a really good implementation wasn’t there until 2003, after the CPUs hit the gigahertz wall – more on this below).

Hardware-wise, in the 90s, there was an explosive growth in the performance of microprocessors; in 1989, the i486 was running at 25MHz and by 2000, Willamette P4 was running at 1.5GHz – this is a whopping 60x increase in frequency in about ten years (and more like a 100x overall performance increase if we count in other optimizations). Some dinosaur hares like me can even remember an interpretation of Moore’s Law saying that not only the number of transistors but also CPU frequencies will double every two years; in 2000, Intel promised 10GHz CPUs by 2005 [Shimpi00].

However, after approximately 2002, all this explosive growth in CPU frequencies (and in per-core performance) abruptly slowed down; frequencies even dropped (reaching 4GHz again only after ~10 years with the 4GHz but hugely inefficient Netburst), and per-core performance improvements have slowed down very significantly [Edwards12].

As a result of this sudden stop in GHz growth, Intel gave up on GHz-based marketing, and switched to marketing based on the number of cores. But there was an obstacle to this – the lack of programs able to use multi-core processors. And Intel has started to promote multi-cored programs. There is nothing wrong with multi-coring per se, but apparently, the most efficient tactics for going this way turned out to be to say “Hey! You can simply take your existing program, and just make a few minor changes (often even called ‘annotations’): just insert this magical mutex (critical section, etc.) whenever you have a data race – and your program will be able to utilize all our cores!

I remember attending an Intel workshop on OpenMP during one of the ACCU conferences in the early 2000s… Marketing-wise, it was probably a success, but OMG, technically it was a disaster. Just two observations from the workshop supposed to teach us proper multithreading; first, there was a multithreading bug even in the few hundreds of lines of code they gave us; and if the Intel folks teaching us multithreading cannot write 200 LoC without a bug, what can an Joe Average developer with expertise elsewhere hope for?! However, while the bug was pretty telling per se, IMNSHO much more annoying was the following occurrence. When we did run their code, it was able to utilize all the cores (“hey, look at taskman – see, we ARE utilizing ALL the four cores!”); however, while all four cores were working like crazy, the wall-clock time of the calculation actually increased after we introduced multi-threading (!! – actually, it is quite common when MT is too fine-grained, more on this below). Of course, Intel’s job at selling us their CPUs does stop as soon as 100% utilization is reached (!), but as a developer who cares about my customers and not about Intel’s profits, I have absolutely zero reason to use this kind of multithreading (and I contend that the whole premise of utilizing more cores is deadly wrong – instead, the whole multi-coring thing is only about decreasing response time to an incoming request, and there is absolutely no other reason to use more than one CPU core to perform a given task).

That’s where we were standing during the 2000s – pretty much everybody was writing multithreaded mutex-synced programs, and “hey, our program is multithreaded” started to be used as a selling point(!); as a result, lots of devs with no idea about multithreading started to add threads just for the sake of it. These were the Dark Ages of Multithreaded Inquisition in programming, which caused programs to crash often due to data races, and to perform poorly too. In my personal experience, I’d estimate that at least half of the crashes I personally ran into were due to some kind of a multithreading race; this is not to mention that our esteemed translator has personally found MT bugs in such projects as OpenSSL, most C++ std:: implementations [Ignatchenko98] of the time, and in a WG21 working paper [NoBugs17]. Worse than that, one of those bugs was reported to manifest itself only once a month on a client site (!). As for performance, one of those faulty std:: implementations in [Ignatchenko98] exhibited up to a 100x performance degradation compared to a mutex-less fix – even in those use cases when it didn’t crash.

To be perfectly honest, even in the Dark Ages, there were people and whole technologies that avoided mutexes (such as Erlang), but sadly, such projects were very few and far between. Fortunately, starting around 2010, a more widespread understanding started to form that mutex-based programs suck, and more and more material started to emerge arguing that mutex-based sync must die – or at least that better alternatives are viable [NoBugs10] [TheGoBlog] [Henney16] [Fowler11b]. I would even say that by 2017, we had already passed the point of no return with regards to mutexes at the app level – and that mutexes will follow the lead of that once ubiquitous and now not really used operator, goto (sure, we can still use goto in quite a few languages – but we should NOT).

So, why do mutexes suck?

Now, after all the historical ranting, let’s start looking at those promised Big Five Reasons for mutexes being, ahem, sub-optimal for the vast majority of tasks out there.

Big fat reason #1: Error-prone beyond belief

The very first reason why mutex-based programs suck big time is that it is extremely easy to write a seemingly-working multithreaded program which actually contains a bad data race. Looking from 50,000 feet, there are two types of data-race-related bugs in mutex-based multithreaded programs: (a) forgotten mutex lock, and (b) deadlock. And while it is possible to write correct mutex-based programs (for (a) you just need to be ultra-hyper-careful, and for (b) meticulously following of some kind of global order of acquiring locks will do the trick), in practice it happens to be an insurmountable task waaaay too often. If even such supposedly expert developers as those of Intel (teaching multithreaded programming, no less!), half a dozen different implementors of the std:: library, and several WG21 members, cannot do it right even for a relatively small and static piece of code – what can the rest of us realistically hope for, especially when working on ever-changing app-level code?!

BTW, while it is theoretically possible to use tools to enforce safe mutex-locking programming practices, even the best thing I know in this regard (the Rust programming language) only solves the problem partially: even after jumping through all the Rust hoops (and there are lots of them ☹), and even if we manage to avoid any unsafe{} code, Rust can only guarantee safety against forgotten mutex locks; as for deadlocks, “Rust considers it ‘safe’ to… deadlock” [Rustonomicon], so deadlocks are still perfectly possible even within ‘safe’ Rust code ☹.

Big fat reason #2: Fundamentally untestable

The second Big Fat Problem™ with mutex-based programs is that these data races (both forgotten locks and deadlocks) are fundamentally untestable (well, at least on all the existing hardware). How it works in the real world:

  • You write your mutex-based program.
  • You test it, and it passes all of your unit tests.

    Some of unit tests may occasionally fail – but when you re-run them, they’re fine, so you ignore such occasional failures (which is BTW a Mortgage-Crisis Size Mistake™).

  • You deploy it to your servers.
  • You run it for months without a hitch.
  • You start saying, “Hey, see – it is easy to write perfectly correct mutex-based programs, I did it myself!
  • Then your Big Day comes (Christmas sale, Tournament of the Year, Black Thursday, etc.) – and the bug which was sitting there all this time manifests itself, so your system crashes (and being on your Big Day, this crash comes at the worst possible moment too).
  • Bummer.

The root of this problem is that inter-thread interaction is inherently non-deterministic (at least with modern hardware). Each and every run of a multithreaded program is potentially substantially different from a previous one, with thread context switches – as well as inter-core interactions – happening at random times on each run. As a result, it is perfectly possible that the same program with the same input data which was running ok 99 times in a row, will crash on the 100th run; even worse, crash probabilities are highly dependent on the environment (including, but not limited to, specific hardware, load at the time of test, other programs running at the same time, and Acrux rising interrupts from your NIC coming at a specific moment).

This, in turn, means that any test we can possibly run is inherently meaningless ☹ ☹ (as Fowler wrote: “Non-deterministic tests have two problems, firstly they are useless…”) [Fowler11a]; in other words – regardless of the amount of testing, we cannot possibly say that we have tested all the possible data-race-related scenarios.

From this very observation, it follows that

We have toprove the correctness of our mutex-based programs.

However, as mentioned above, even after jumping through all the hoops of Rust, we cannot really prove the correctness of our mutex-based programs (as the potential for deadlocks is still there); in general, automated tools for such proofs are not really here (and given the rest of the problems with mutex-based stuff, chances are they will never be written).

In the real world, this problem happens to be soooo bad that once upon a time I was even involved in writing a tool which used fibers to simulate a preemptive scheduler to ensure that if a bug in a program under test is found during such testing, the bug is at least reproducible (also, it was possible to run an exhaustive test, though this was feasible only for very small programs).

Big fat reason #3: Mental overload → development slow down

In addition to the two first Big Fat Problems discussed above, the third one becomes apparent – being ultra-careful and meticulously following prerequisites to writing correct code (exacerbated by an inherent inability to test our mutex-based programs) tends to take a toll on the mind of the app-level developer who tries to do it. The cognitive capabilities of our brains are actually very limited, so being occupied with mutexes and their order – which are things light years away from the app-level task we’re trying to solve – inevitably causes much fewer brain resources to be available for that app-level task we’re really programming right now.

As discussed in [NoBugs15], thinking both about thread sync an about the business-level task at the same time quickly leads to an explosive growth in number of entities the developer has to keep in mind while programming; this, in turn, leads to exhaustion/exceeding the cognitive capacity of our brains (violating the so-called 7 ± 2 rule, and leading to cognitive overload), which means that either development speed has to suffer a lot, or our program will have more bugs (including non-multithreaded ones!), or both. To make things even worse, maintenance of mutex-based code (and we do have to re-prove MT correctness after each and every potentially relevant change!) is not really feasible for anything more complicated than a ‘Hello, mutex!’ program.

Big fat reason #4: Poor performance

The fourth reason for mutexes sucking Really Badly™ follows from two observations: (a) whenever we’re trying to obtain a lock on an already-locked mutex, we’re about to cause an inter-thread context switch, and (b) the cost of an inter-thread context switch is huge.

Let’s discuss the cost of an inter-thread context switch. In [Li07], it was observed that as soon as we account for cache invalidation costs, the cost of an inter-thread context switch can easily reach a million CPU clock cycles; my personal calculations show that on a modern Xeon-class CPU, theoretically cache invalidation can cost up to 30M CPU clock cycles. In practice, my personal observations are much milder, but even the 20–50K CPU clock cycles I observed are bad enough to want to avoid mutexes, at least in seriously contentious scenarios. BTW, IMNSHO my observations are consistent with the practice of spin-locks; just think of it: how bad is the cost of a thread context switch that it’s better to be busy-waiting for several thousand iterations (with each taking at least 3 CPU cycles to read from L1) just in the blind hope that we will avoid the thread context switch? That is, if we’re lucky, with a spinlock we’ll just incur the penalty of the spinning though not of the context switch, but if we’re NOT lucky – we’ll incur both the cost of the spinning and the cost of the context switch; and even with all of this in mind, spinning still makes sense for quite a few use cases…

Yet another observation, which will lead us to about the same numbers, is that (as mentioned above) it is easy to utilize more cores while increasing the total time of solving a certain problem; as mentioned above, this problem often manifests itself if our calculations are split into slices which are way too fine-grained (and this often causes context switches to kick in, eating up all the resources). As a Big Fat Rule of Thumb™, if we’re running for ~100ms, we can realistically consider a/the? thread context switch negligible (actually, 100–200ms is a typical time slice of a modern OS, and for this very reason too). On a modern CPU, 100ms corresponds to ~300M CPU cycles, so the decision of OS developers to have a time slice of 100–200ms to ensure that thread context switch costs are negligible looks rather consistent with our worst-case estimates of millions CPU cycles per thread context switch.

There is one more way to think about it: if we take a look at a CPU, we’ll realize that it is inherently an event-driven system; all we have at the hardware level are cores and hardware interrupts routed to those cores, there is nothing more – and in this case, each core is actually a (Re)Actor with its state being core registers + associated memory, allowing the (Re)Actor to handle interrupts (including timer interrupts which cause preemption) as its input events. From this point of view, the very concept of a thread is an artificial entity (~= ‘abstraction with a non-zero cost’) – and using threads, and especially inter-thread synchronization between those threads, the non-zero cost of this abstraction starts to hurt performance.

Big fat reason #5: Lack of scalability

By design, shared-memory approaches (which BTW include both mutexes and atomics) cannot be scaled beyond one single computer (well, strictly speaking, it is possible to simulate shared memory across boxes, but it is going to be excruciatingly slow [StackOverflow] – that’s like 1000x slower, which makes it impractical).

Ok, mutexes are bad – but is there anything better?

By now, I hope that I have made a case against mutexes (and, more generally, against shared-memory architectures). But this line of argument makes sense only if something better exists (otherwise we’re stuck with “Hey, it is the worst thing in the universe – except that it is the only one, so we have to use it anyway”).

Message-passing, including (Re)Actors a.k.a. Actors a.k.a. Reactors a.k.a. event-driven a.k.a. ad hoc FSMs

Fortunately, there is a well-known approach that solves all the issues raised above (though not without some cost): it is using message-passing shared-nothing architectures. They have been known for ages; in modern computing, the oldest living relative of the shared-nothing message-passing stuff is probably Erlang; however, recently many more technologies have emerged which are operating more or less along the same lines (though they’re mostly event-driven which is a subset of more generic message-passing): Akka Actors, Python Twisted, Go (at least as its idiomatic version), and of course, Node.js; also there is an ongoing development in which I am involved too [Node.cpp].

Let’s take a look at the Five Big Deficiencies of mutexes we discussed above, and observe how shared-nothing message-passing and (Re)Actor-like programming models fare in this regard:

  1. MT-error prone. As message-passing and event-driven programs behave ‘as if’ they’re single-threaded, they’re naturally mostly free from shared-memory artifacts such as forgotten mutexes and deadlocks (well, in theory it is possible to have a deadlock between different message-passing programs, but during all my years with them I didn’t see one).

    More formally, while message-passing programs may still use some multithreading primitives deep inside (say, to implement inter-thread queues), these primitives are extremely small and never-changing, and therefore can be proven to be correct. And as soon as we’re inside a message-passing program which doesn’t use shared memory, all such programs can be proven just once to be MT-correct regardless of their specific logic.

  2. Untestability. Unlike mutex-based stuff, message-passing programs are testable; in other words, if we feed the same inputs to a message-passing program, it will produce the same results (=‘our tests become reproducible’)

    Even better, message-passing/(Re)Actor programs can be made deterministic in a practical sense, all the way up to recording-replay logic, which can be used in production (!!). While such things are known, I don’t know of a single generic framework providing this functionality (except for WIP [Node.cpp]).

  3. Complexity. As with message-passing programs, there are no mutexes to worry about, which eliminates related mental costs. In practice, the development of non-mutex programs is known to be significantly faster than that of mutex-based ones.

    As discussed in [DDMoGv2], while asynchronous processing has its own complexity costs, as soon as we’re speaking about the interaction between different requests, this complexity happens to be much lower than that of thread-synced approaches; this becomes even more true if we can use modern async-oriented constructs such as futures/promises and the await operator in C#, Node.js, and C++.

  4. Performance. No mutexes and blocking calls → no forced thread context switches → no millions of CPU cycles spent on those thread context switches. In practice, it means the best possible performance for our message-passing/(Re)Actor program (hey, there is a reason why nginx performs significantly better than Apache). One way to think about comparing a mutex-based system with a message-passing one is an analogy with traffic: each mutex is a traffic light, slowing things down; and message-passing doesn’t have mutexes/traffic lights, which allows to move like on a highway – much faster.
  5. Scalability. Message-passing programs can scale to multiple boxes easily; two common examples are Erlang and MPI, but I’ve seen much more examples in the real-world than that.

As we can see, none of the Five Big Fat Problems of mutex-based programs apply to message-passing and event-driven programs.

Conclusion

We discussed Five Big Fat Reasons why mutexes are bad… Moreover, we took a look at the alternative message-passing/event-driven programs and found that none of the Five Big Reasons apply.

Therefore:

If you care about any of the {correctness | MTBF | complexity | performance | scalability} of your programs, do you and your customers a favour and get rid of mutex-based shared-memory abominations.

Note that this does NOT exclude multi-coring as such – but moves thread sync out of sight of the app-level and very severely limits its nature, so it becomes manageable (and its correctness, provable).

Bibliography

[DDMoGv2] ‘No Bugs’ Hare, ‘Development and Deployment of Multiplayer Online Games’, vol. II

[Edwards12] Stephen A. Edwards (2012) ‘History of Processor Performance’ http://www.cs.columbia.edu/~sedwards/classes/2012/3827-spring/advanced-arch-2011.pdf

[Fowler11a] Martin Fowler (2011) ‘Eradicating Non-Determinism in Tests’, https://martinfowler.com/articles/nonDeterminism.html

[Fowler11b] Martin Fowler (2011) ‘The LMAX Architecture’, https://martinfowler.com/articles/lmax.html

[Henney16] Kevlin Henney, (2016) ‘Thinking Outside the Synchronisation Quadrant’, presented at code::dive 2016 and published 27 June 2017 at https://www.slideshare.net/Kevlin/thinking-outside-the-synchronisation-quadrant

[Ignatchenko98] Sergey Ignatchenko (1998) ‘STL Implementations and Thread Safety’, C++ Report, July/Aug 1998

[Li07] Chuanpeng Li, Chen Ding, Kai Shen, ‘Quantifying The Cost of Context Switch’, https://www.usenix.org/legacy/events/expcs07/papers/2-li.pdf

[Loganberry04] David ‘Loganberry’, Frithaes! - ‘An Introduction to Colloquial Lapine!’, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[McCracken02] Dave McCracken (2002) ‘POSIX Threads and the Linux Kernel’ in Proceedings of the Ottawa Linux Symposiumhttps://landley.net/kdocs/ols/2002/ols2002-pages-330-337.pdf

[NoBugs10] ‘No Bugs’ Hare, ‘Single-Threading: Back to the Future?’ in Overload 97, June 2010, https://accu.org/index.php/journals/1634

[NoBugs15] ‘No Bugs’ Hare, Multi-threading at Business-logic Level is Considered Harmful, Overload 128, Aug 2015, https://accu.org/index.php/journals/2134

[NoBugs17] ‘No Bugs’ Hare, ‘Eight Ways to Handle Non-Blocking Returns in Message-Passing Programs’, CPPCON17, http://ithare.com/eight-ways-to-handle-non-blocking-returns-in-message-passing-programs-with-script/

[Node.cpp] Node.cpp, https://github.com/node-dot-cpp

[Rustonomicon] The Rustonomicon, ‘What Unsafe Rust Can Do’, https://doc.rust-lang.org/beta/nomicon/what-unsafe-does.html

[Shimpi00] Anand Lal Shimpi (2000), ‘The future of Intel’s manufacturing processes’on Anandtechat https://www.anandtech.com/show/680/6

[StackOverflow] Sharing memory across multiple computers?, Stack Overflow, https://stackoverflow.com/a/11903065/4947867

[TheGoBlog] The Go Blog ‘Share Memory By Communicating’, July 2010, https://blog.golang.org/share-memory-by-communicating

[Wikipedia] History of operating systems, Wikipedia, https://en.wikipedia.org/wiki/History_of_operating_systems

‘No Bugs’ Hare Translated from Lapine by Sergey Ignatchenko using the classic dictionary collated by Richard Adams.

Sergey Ignatchenko has 15+ years of industry experience, including being a co-architect of a stock exchange, and the sole architect of a game with 400K simultaneous players. He currently holds the position of Security Researcher.

Overload Journal #149 - February 2019 + Design of applications and programs