Multi-threaded code promises potential speed-up. Sergey Ignatchenko considers how it often slows things down instead.
Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [ Loganberry04 ]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.
Assumption is a mother of all screw-ups
~ Honorable Mr. Eugene Lewis Fordsworthe
For quite a long time (since I first needed to deal with non-trivial multi-threading 15 years ago) I knew that mixing multi-threading (especially thread synchronization) with business logic is a Really Bad Idea and argued for avoiding it whenever possible (see, for example, [ NoBugs10 ]). This notion became so deeply ingrained in my mind, that I’ve erroneously started to assume that everybody else shares this knowledge (or belief, depending on which side of the argument you are ☺).
As usually happens with most assumptions, Mother Nature has once again proved that I was wrong. Recently I wrote an article on networking for games [ NoBugs15 ], where I took ‘mixing multi-threading with business logic is a Bad Idea’ as granted; I’ve had feedback that this is unclear and needs explaining. Ok, here goes the explanation (an outline for this explanation has already been provided in [ Ignatchenko15 ], but this is a much more elaborate version with a few additional twists).
There are four Big Reasons for avoiding handling both business logic and non-trivial multi-threading within the same pieces of code. However, before going into reasons, we need to provide some definitions.
In this field, a lot depends on how trivial your multi-threading is. For example, if you have multi-threading where all the synchronization is performed on one single mutex, we can call it ‘trivial’ (and, as shown below, you’re likely to be able to get away with it1). However, the window for triviality is very narrow: for example, even going into two interrelated mutexes instead of one, can easily make multi-threading non-trivial (and, as discussed below, has potential to make your life a nightmare).
Another example is trivialized multi-threading (a close cousin of the trivial one); one good example of trivialized multi-threading is when all the inter-thread interactions are made only via queues. It doesn’t mean that implementing queues is trivial, but that from the point of view of the developer-who-writes-business-logic, he doesn’t need to care about queue implementation details. In other words, the problem is not about having multi-threading within your program, it is about mixing multi-threading synchronization with business logic in the same piece of code.
Now, we’re all set to start discussing why you shouldn’t intermix non-trivial multi-threading synchronization with business logic.
Reason 1: Cognitive limits of the human brain
In psychology, there is a well-known ‘7 ± 2’ cognitive limit [ Wikipedia ]. This means that the number of objects an average human can hold in working memory is 7 ± 2.2 When you go above this limit, you a get kind of ‘swapping’ (to ‘swap out’ some entities to free space in your working memory, only to ‘swap them back in’ when they’re needed). And from our programming experience, we all know what swapping does to performance (‘slowing down to a crawl’ being a very mild description). A similar thing happens when a human being goes beyond his cognitive capacity – the process of solving the problem becomes so slow that often the problem cannot solved at all (unless it can be split into smaller problems, with each of these problems fitting into cognitive limits).
BTW, don’t think that as you are not an average person3, you will be able to process 70 objects or entities instead of the average 7 – you won’t; the best you can realistically hope for is 10–15, and this difference won’t change our analysis. And even if you have on your team one person with an exceptionally high cognitive limit, you can be sure that it is extremely uncommon, which means that relying on her abilities to maintain your program is a Really Bad Idea. The simple question, “What are we going to do when she leaves?” is enough to bury the idea of relying on One Single Developer (however much of a genius she is).
So, how does this 7 ± 2 limit apply to combining business logic with multi-threading? The answer is simple: for real-world programs, each of these things is already complicated enough and usually is already pushing this “7 ± 2” limit. Combining them together will very likely take you over the limit, which will likely lead to the problem of ‘making the program work’ becoming unsolvable. Exceeding the limit becomes even more obvious when we observe that when adding multi-threading to business logic, we’re loading our brain with not only analysis of readily visible entities such as threads and mutexes, but also with less obvious entities such as how existing business objects will interact with this mutex? With these TWO mutexes? This brings the number of entities even higher, which in turn makes the cognitive overload even worse.
For trivial (and trivialized) multi-threading, this effect, while present, can be seen as adding (very roughly) only one additional entity; while even one additional entity can also bring you over the cognitive limit, it is still much better than having dozens of additional entities in scope. Also, cognitive limits are not exactly hard limits as in “9 and you’re fine, 10 and you’re mine”, and while one extra entity over the limit would clearly mean reduced overall performance of the developer, it isn’t likely to cause 100% drop in performance (so it shouldn’t go into the ‘problem never solved’ area). Therefore, given the very small typical numbers for cognitive limits, while adding even one entity will be noticeable (so is not desirable), it is not very likely to be fatal.
Reason 2: Non-determinism is bad enough, but inherently untestable programs are even worse
We don’t know what we have until we lose it
Non-trivial multi-threaded code usually has one property – it is inherently non-deterministic.
By the very definition of pre-emptive multi-threading, context switches happen not when you expect them, but between any two assembler-level instructions (yes, we’re not considering disabling interrupts within business logic). On one run of the program, a context switch may happen between lines A and B, and on the next run of the very same program, it may happen between lines B and C (on some runs it may happen even in the middle of a line of code, if it is compiled to more than one assembly instruction). It means that the multi-threaded program MAY become non-deterministic, i.e. it MAY behave differently from one run to another even if all the program inputs are exactly the same.
One may ask, “What is so bad about that?” Unfortunately, this potential non-determinism has several extremely unpleasant implications.
As you have no way to control context switches, you cannot really test your program.
Your multi-threaded program can pass all of your tests for years, and then, after you’ve changed a line in one place, a bug in a completely unrelated place (which has existed for all these years, but was hidden) – starts to manifest itself. Why? Just because context switching patterns have shifted a bit, and instead of context switch between lines A and B, you’ve got a context switch between lines B and C.
In [ Ignatchenko98 ] a multi-threading bug is described, which has manifested itself on a 20-line program which has been specially written to demonstrate the bug, and it took any time between 20ms to 20s (on the very same computer, just depending on the run) for the bug to manifest itself (!). On a larger scale – it was a bug no less than in the Microsoft C++ STL implementation shipped with MSVC (carrying a copyright by no less than P.J. Plauger), and while the bug was sitting there for years and has manifested itself in a real-world environment, the manifestation was usually like “our program hangs about once a month on a client machine with no apparent reason”, which is virtually impossible to debug. Only careful analysis of the STL code found the bug (and the analysis wasn’t related to any specific problem with any specific program, it was done out of curiosity).
Another example of untestability is as follows. Your program passes all the tests in your test environment, but when you deploy it to the client’s computer, it starts to fail. I’ve observed this pattern quite a few times, and can tell that it is extremely unpleasant for the team involved. The reason for failure is the same – context switch patterns have shifted a bit due to different hardware or due to different load patterns on client’s machine.
Bottom line: you cannot rely on testing for multi-threaded programs. Bummer.
Non-determinism implies that on every program run you get different patterns.
This means that if you have a bug in your multi-threading code, you won’t really be able to jump to a certain point in the debugger and see what’s going on (nor will you be able to print what happens there, unless you’re printing everything in sight over the whole program, which by itself will shift patterns and may mask the bug). Ok, technically you are able to jump to any point of your program, but the variables you see may (and if you have a multi-threaded bug – will) differ every time you jump there.
This makes debugging multi-threaded issues a nightmare. When the bug manifests itself about every 50th run of the program, it is already bad enough for debugging, but when the pattern you see is a bit different every time when it happens – your task of debugging the program can easily become hopeless.
Many of you will say “Hey, I’ve debugged multi-threaded programs, it works perfectly”. Indeed, much debugging works in a multi-threaded environment, and you can debug a multi-threaded program, you just cannot debug subtle multi-threaded issues within your non-trivial multi-threaded program.
To allow for multi-threaded debugging in one of many complicated multi-threaded projects, we went as far as creating our own fiber-based framework which simulated threads, with our own simulated scheduler and switching at the relevant points. Our simulated scheduler was run using a pseudo-random generator, so when seeding it with the same original seed, we’ve got determinism back, and were able to debug the program. For us, it was the only way to debug that program (!). There are similar tools out there (just Google for “deterministic framework to debug multi threaded program”), and they might help, but while helpful for debugging small primitives, such methods are inherently very time-consuming and most likely will be infeasible for ongoing debugging of your business logic.
C: Need for proofs of work (or exhaustive deterministic testing)
So, we’ve found (both from theory and illustrated by experience) that no kind of testing can serve as a reasonable assurance that your multi-threaded program will work, and that debugging is likely to be a real problem. Sounds Really Bad, doesn’t it? More importantly, can we do something about it?
In practice, I tend to provide proofs of work for any non-trivial multi-threaded code. I’ve found from experience, that it is the only way to ensure that a multi-threaded program will work 100% of the time (opposed to working 99.99% of the time, which means failing here and there), and will work everywhere.
For small pieces of code (20–50 lines) it is perfectly feasible. The level of formality you need for your proofs is up to you, but it is important at least to convince yourself and somebody else, that with any pattern of switches the piece of code in question will work as expected. One good example of code where more or less formal proofs are feasible (and necessary) is an implementation of the queue for inter-thread communications.
Of course, for thousands-of-lines business logic, such proofs are not feasible (that is, unless you trivialize the interaction of business logic with multi-threading).
An alternative to proofs of work is to use one of those deterministic testing frameworks mentioned above, and to perform exhaustive testing, testing the program behavior for all the possible (or at least relevant, though the notion of ‘relevant’ requires very careful consideration) context switches. Our own framework (the one mentioned above) did allow such testing, but times for such exhaustive testing were growing at least exponentially as the size (more precisely – number of points of interest where the context switch might be relevant) of the program grew, so once again such exhaustive testing wasn’t feasible for the programs with over 20–50 lines of code.
Reason 3: Code fragility
A logical consequence of untestability and the need for proofs of work is code fragility. If, whenever you need to change the program, you need to re-prove that it still works, this cannot be safely entwined with business logic (which, by definition, changes 5 times a day). If, whenever you’re changing something, you’re afraid that it might break something somewhere 50000 lines of code away, it won’t work either.
More formally, a non-trivial mixture of business logic with thread synchronization is inherently fragile. Any change in business logic is likely to affect non-trivial thread synchronization, which in turn is likely to lead to impossible-to-test and next-to-impossible-to-debug bugs.
Reason 4. Context switching granularity
To be efficient, multi-threading programs SHOULD make sure that they don’t cause too much context switching (i.e. multi-threading SHOULD be coarse-grained rather than fine-grained). The thing is that context switches are damn expensive (taking into account the cost of recovery from thread caches being flushed out by another thread, think of the order of 10,000 CPU clock ticks on x86/x64).
For example, if you want to move integer addition to another thread, you’re likely to spend 20,000 CPU clock ticks for 2 context switches (to another thread and back, with roughly half of the work being in your original thread), and to save 0.75 CPU clocks on offloading the addition. Of course, this is an extreme example, but way too often multi-threading is used without understanding the implications of the cost of the context switches.
In this regard, separating business logic from threading helps to establish a well-defined interface which encourages (though doesn’t guarantee) coarse-grained granularity. For example, when having queues for inter-thread communications, it is usually easier to write a coarse-grained program, which is (as a rule of thumb; as with anything else, there are exceptions) a Good Thing.
On the opposite side, code which intermixes business logic and thread synchronization tends to overlook the need to keep granularity in check; while in theory it is possible to handle it properly, in practice adding it to the equation is not feasible, not least because of adding yet another layer of entities, overloading (already overloaded) cognitive limits even further.
Hey, there are working multi-threaded programs out there!
One may say: “Hey, you’re saying that writing multi-threaded programs is impossible, but everybody and his dog is writing multi-threaded programs these days!”. You do have a point. However:
- Quite a few multi-threaded programs are using trivial multi-threading (for example, with a single mutex). Have you ever seen a multi-threaded program which is able to utilize only 1.2 cores? They’re likely using single mutex. And BTW, I cannot blame them as soon as they provide adequate overall performance: if one marketing guy has said “we need to write ‘support for multiple cores’ on our website, because all the competition does it”, a single mutex is one way to do what marketing wants without jeopardizing the whole project.
- Quite a few programs (think of video codecs) do really need to utilize multiple cores, but don’t really have much business logic (depends on how you define ‘business logic’, but at least it doesn’t change too often for codecs). They may get away with more or less complicated thread sync, but even for video codecs having per-frame (or per-large-part-of-frame) processing granularity (with clearly defined inter-thread interfaces such as queues) tends to work better than alternatives.
- Quite a few multi-threaded programs out there do have those difficult-to-find-and-debug bugs. This is especially true for those programs which don’t have a multi-million install base [ Wikipedia-2 ], but having a large install base certainly doesn’t guarantee that the program is multi-threaded-bug-free. I would guesstimate that for those programs which are released (i.e. out of the development shop), at least 50% of crashes are related to multi-threading.
- And finally, there are programs out there which do follow the principles outlined in the next section, ‘Divide and conquer’.
Divide and conquer
The first step in solving a problem is to recognize that it does exist
~ Zig Ziglar
Despite the ‘Divide and conquer’ concept (originally Latin Divide et impera ) coming from politics, it is still useful in many fields related to engineering, and is usually a Good Thing to use in the context of programming (not to be confused with programming team management!).
Jokes aside, if we can separate business logic from non-trivial multi-threading (trivializing multi-threading interaction from the point of view of business logic), we will be able to escape from (or at least heavily mitigate) all the problems described in this article. The number of entities to fit into cognitive limits will come back to reasonable numbers, business logic will become deterministic again (and while multi-threading synchronization will still require proofs of work, they are feasible for small and almost-never-changing pieces of code), code will be decoupled and will become much less fragile, and coarse-grained granularity will be encouraged.
The only teensy-weensy question remaining is “how to do it”. There are several approaches to start answering this question, and I hope to describe one of them sooner rather than later . For now, we need to recognize that we do have a problem, solving it is the next step.
[Ignatchenko15] Sergey Ignatchenko, ‘Three Reasons to Avoid Intermixing Business Logic and Thread Synchronization’, http://java.dzone.com/articles/three-reasons-avoid
[Ignatchenko98] Sergey Ignatchenko, ‘STL Implementations and Thread Safety’, C++ Report , July/Aug 1998
[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html
[NoBugs10] ‘No Bugs’ Hare, ‘Single-Threading: Back to the Future?’, http://accu.org/index.php/journals/1634
[NoBugs15] ‘No Bugs’ Hare, ‘64 Network DO’s and DON’Ts for Game Engines. Part IIIa: Server-Side (Store-Process-and-Forward Architecture)’, http://ithare.com/64-network-dos-and-donts-for-game-engines-part-iiia-server-side-store-process-and-forward-architecture/
Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.