Multithreading 101

Multithreading is a huge topic and this article is really just giving an introduction to many terms and concepts. It is aimed at the threading newcomer and intends to highlight some of the problems associated with writing multithreaded applications.

Firstly I’ll cover the basics on threads – what are they, and how do they relate to processes. Next there are some guidelines of where threads can be useful, and when not to use them. Then I’ll explain the “primitives”, the underlying building blocks provided by the operating system kernel to allow multithreading programs to be written.

Following that I cover some concepts that do not at first seem intuitive: memory barriers – what they are and why they are needed; race conditions – with specific examples of deadlock and livelock; and priority inversion – where a low priority task gets more processor time than a high priority task. Lastly a look at the where the world of threading is headed.

Tasks, Processes and Threads

In the beginning there was synchronous execution. A list of commands were executed in order, one at a time. As time progressed, more and more was expected from an operating system. Multitasking allowed multiple programs to appear to be running at the same time, while in actual fact the CPU was still only capable of running one thing at a time.

In order to switch computation from one task to another there would be a context switch. The execution details of one task were saved out of the registers in the CPU, and details of another task were moved in, and execution of the new task would resume. Multitasking appeared in two forms: cooperative and pre-emptive. With cooperative multitasking, it is up to the application to relinquish control, whereas with pre-emptive multitasking the OS kernel will interrupt the task “mid flow” and switch it out. The primary problem with cooperative multitasking is that it only takes one badly behaved process to disrupt the entire system.

Processes can be thought of as tasks that have certain isolation guarantees with respect to memory and resources. Threads are created by a process – from its “main thread”. Created threads share memory and resources with the thread that created it.

Shared access to memory and resources combined with preemptive multitasking is where the complexity of multithreaded programming comes in.

When to Use Threads

Unfortunately there is no black and white answer for this question. There are several important questions that you should ask yourself before diving into multiple threads:

Are shared resources really needed?
Is the extra complexity worth the effort?

There are many cases where it seems desirable to have separate threads – particularly when there may be clean separation in processing. An important question here is “should they be separate threads or separate processes?” Do both tasks really need access to shared resources or is there just information being passed from one to the other? Two single threaded programs are often better than one multithreaded program due to the following points:

Single threaded code is easier to write – no resource synchronization needed
Single threaded programs are easier to debug
Single threaded programs are easier to maintain – no threading knowledge needed
Separate programs are easier to test – each can be tested in isolation

Threading adds complexity to often already complex situations. A good rule of thumb is “Is the complexity that is added through the use of threads significantly reducing the complexity of the problem?” If the answer is “NO” then you may want to strongly reconsider the use of threads. However there may be other overriding factors that push strongly towards a threaded solution:

GUI responsiveness
Complete processor utilisation
Network connectivity

GUI Responsiveness

By their nature GUI programs are event driven. For much of the time, the programs are waiting for the user to click or type something. GUI programs are normally written with a main event loop where the application needs to respond to “desktop events” such as repaint or move as well as user events. When applications fail to respond to these events in a timely manner, the operating system often marks the processes as non-responsive and the application can appear frozen.

There are often situations in writing an application where long running functions are needed (for some arbitrary definition of long) – for example: reading or writing files, running a spell check, or encoding some data. In order to have a responsive GUI for the application it is not desirable to have these functions running in the same thread as the main event processing loop.

Complete Processor Utilisation

Some classes of problems lend themselves to being easily broken into independent parts. Some classes of problems need to be solved as soon as possible. Some classes of problems are purely calculation intensive where the problem is not waiting on other results or information from other sources. The intersection of these three sets of problems are candidates for a solution where all the processors in a machine could be fully utilised. In situations where separate processes are not viable, having a number of threads to do the processing will provide a time benefit. One of the tricks is not to have more processing threads than effective CPUs (effective as a single hyperthreading CPU can appear as two). The reason for this is that if all the threads are CPU bound (which means the only thing slowing the calculation of the results is the speed of the processor), then ideally there should be one thread assigned to each effective CPU. If there are more calculation threads than CPUs then more context switching occurs as the active threads compete with each other for the processors. Every context switch has an overhead which slows down the results generation even more.

Network Connectivity

Threads used in this way are often referred to as “worker threads”. A point to consider when dealing with worker threads is when to create them. There are always overheads involved when creating and destroying threads. The amount of overhead varies from platform to platform. It is a common idiom to use a collection of threads that are started prior to any work being available for them. These threads then wait for work to do. The collection of threads is often referred to as a “thread pool” or “work crew”. Since the thread generating the work does not pass it on to a specific thread, there needs to be some abstraction between the threads generating the work and the threads processing the work. This is often implemented using a queue. The work generating threads put work on the queue, and the workers take work from the queue. In situations such as this there are normally two types of work: those that have feedback – results passed back to the work generators; and those that do not – for example the worker threads save results directly to a database. How the feedback from the worker threads is handled depends greatly on the problem – suffice to say there are many ways to do it.

Whether on the client or the server side, threads are often used when communicating over a network. On the client side threads are used when requesting a response that may take the server some time to respond to – especially if the client side application has more to do than just wait for the response such as responding to GUI events.

Threads are more common on the server side where the server may be listening on a socket. When a client connects a socket descriptor is generated that represents the connection between the client and the server. The socket descriptor can then be used independently of waiting for more connections. In order for the server application to remain responsive to client connections, the actual work with the socket descriptors is normally handled by a different thread than the thread where the program is accepting connections.

Thread pools are frequently used to handle server side execution where the threads wait for the primary thread to accept a client connection. The socket descriptor is then passed to an idle worker thread in the thread pool. This thread then handles the conversation with the client and performs the necessary actions to produce results for the client. Once the conversation with the client is finished, the worker thread releases the socket descriptor and goes back into a waiting state.

Primitives

Providing an API to start, stop and wait for threads is really just the tip of a much bigger iceberg of necessary functions. The simplest of these functions are atomic operations. These are functions, provided by the OS kernel, that are guaranteed to complete before the task is switched out.

As well as atomic additions and subtractions, there are also functions that perform more than one “action” such as “decrement and test” and “compare and swap”.

Decrement and test subtracts one from the atomic variable and returns true if the atomic variable is now zero.
Compare and swap takes three parameters, the atomic variable, the expected current value, and the new value. If the atomic variable is the expected value then it is updated to the new value and returns true. If the atomic variable is not the expected value, then the expected current value is updated to what the atomic variable currently holds and the function returns false.

These atomic operations are then used to build more complex primitives.

In order to provide some isolation guarantees for multiple threads within one process, the operating system provides a way to define mutual exclusion blocks. The primitive that provides the interface to these blocks is often referred to as a mutex (mut-ual ex-clusion). The key to how the mutex works is in the acquiring and releasing. When a mutex is acquired, it is effectively marked as being owned by the thread that acquired it. If another thread attempts to acquire the mutex while it is still owned by a different thread, the acquiring thread blocks waiting for the mutex to be released. When the mutex is released it is marked as “unowned”. In the case where there is another thread waiting, it is effectively woken up and given ownership of the mutex. Some care needs to be taken acquiring mutex ownership as some implementations will not check for self ownership before waiting, so it is possible for a single thread to wait for itself to relinquish ownership – not a situation you really want to be in.

Any non-trivial data structure uses multiple member variables to define the structure’s state. When the data structure is being modified it is possible that the thread that is updating the data structure could be switched out. Another thread that wants to access the shared data structure could be switched in. Without the use of mutual exclusion blocks it is possible that the data structure’s state is inconsistent. Once this happens the program is often in the world of undefined behaviour.

Consider the following example:

There is a container of Foo objects called foo. 
A mutex to protect foo access called m 

Thread 1: 
    acquire mutex m 
    bar = reference to a value in foo 
    release mutex m 
    use bar... 

Thread 2: 
    acquire mutex m 
    modify foo 
    release mutex m

What is protecting bar ? It is possible that thread 2 may modify the value of bar while thread 1 is trying to use it. Protecting memory access between threads is almost always more complex than it first looks. One approach is to put mutexes around everything, however this does little more than serialise the application and prohibits the concurrency that is intended through the use of threads. Multiple threads with extraneous mutexes often perform worse than a single threaded approach due to the additional overhead of context switches.

The next two primitives are more for synchronization than mutual exclusion. They are events and semaphores . Events and semaphores can be used to initiate actions on threads in response to things happening on a different thread of execution.

Unfortunately there is no agreed standard on what an event is, and the two most common platforms (Win32 and posix) handle things quite differently. Without going into too much detail, you can have threads waiting on an event. These threads block until the event occurs. When the event is signalled (from a non-blocking thread obviously) one or more threads (implementation dependant) will wake up and continue executing.

An example of using an event would be a client-server system where the client is able to cancel long running calls. One way of doing this is to have the thread that is communicating with the client not actually do the work, but instead get another thread to do the work and be told when the work is complete. The way in which the servicing thread is informed of complete work is through the use of an event.

Thread 1: 
	gets client request 
	passes on to worker thread 
	waits on event 

Worker thread: 
	gets work for client 
	processes for a long time 
	signals event when complete 

Thread 1: 
	wakes on event signal 
	returns result to client

What advantage does this give over just one thread? Implemented this way allows the possibility of having either time-outs for long work or client driven cancellation. For client driven cancellation it may work something like this:

Thread 1: 
    gets client requesting 
    passes on to worker thread 
    waits on event 

Worker thread: 
    gets work for client 
    processes for a long time 

Thread 2: 
    client cancels original request 
    sets cancelled flag 
    signals event 

Thread 1: 
    wakes on event signal 
    returns cancelled request

What happens to the worker thread depends on the problem. The worker thread may be allowed to run to completion and discard results or it maybe be possible to interrupt the worker thread prior to completion.

For time-outs instead of having a client connection initiating the cancellation it is some “watchdog” thread whose job it is to observe running work and cancel work that runs overtime.

A semaphore can be thought of as a counter that can be waited on. A semaphore is initialised with a non-negative number. When a thread wants to acquire a semaphore, this value is checked. If the value is positive, then the value is decremented and execution continues. If the value is zero, then the thread blocks. The semaphore value is incremented by a non-blocked thread calling the “signal” or “release” semaphore function. If there is a waiting thread, it will awaken, decrement the value and continue executing.

Semaphores are often used where there are multiple copies of a resource to be managed. For example a database connection pool. Instead of initiating a connection every time the database needs to be accessed, a number of connections are created “ahead of time” and await use. One way of managing this pool of connections is to create a semaphore with an initial count of the number of available connections. When a thread requests a connection it does so by acquiring the semaphore. If the value of the semaphore is non-zero the semaphore is decremented which means there is a database connection available. When the thread is finished with the connection, the connection is returned to the pool and the semaphore is incremented. If the semaphore is zero when a connection is requested, the thread blocks until the semaphore is incremented. When a connection is returned to the pool and the semaphore incremented a thread waiting on the semaphore will awaken – decrementing the semaphore in the process, allowing it to get the connection.

Memory Barriers

Advances in processor efficiency have led to many tricks and optimisations that the processor can do in order to speed up memory access. Grouping together memory reads and memory writes, along with block reads and writes are some of the tricks that are done to make access faster. The reordering of memory reads and writes are a special case of the more general optimisation referred to as “execution out of order”. The commands are grouped in such a way that any grouping that is done is transparent to a single thread of execution. The problem occurs when there is more than one thread sharing the memory that is being read or written to in the optimised manner.

Suppose we have two threads as illustrated by this pseudo code:

x = 0, y = 0 

Thread 1: 
	while x == 0 loop 
	print y 

Thread 2: 
	y = 42 
	x = 1

It might seem reasonable to expect the print statement from Thread 1 to show the value 42, however it is possible that the processor may store the value for x before y, and hence the printed value may show 0.

A memory barrier is a general term that is used to refer to processor instructions that inhibit the “optimal” reordering of memory access. A memory barrier is used to make sure that all memory loads and saves before the barrier happen before any loads and saves after the barrier. The actual behaviour of memory barriers varies from platform to platform.

Many synchronization primitives have implicit memory barriers associated with them. For example it is usual for a mutex to have a memory barrier defined at the point of acquiring and at the point of releasing.

As well as processor execution out of order, there are also issues with compiler optimisations. What a compiler may do depends on the memory model for the language that is being used. The more thread aware languages provide keywords which will stop the compiler making certain assumptions, for example the volatile keyword in Java.

Race Conditions

A race condition is a catch-all term for situations where the timing of the code execution can impact the result. The definition provided by the Free On-Line Dictionary Of Computing [1] is “Anomalous behavior due to unexpected critical dependence on the relative timing of events”.

Many of the well known problems associated with multithreaded code are types of race conditions. A large number of race conditions can be solved through the use of synchronization primitives. However overly liberal use of synchronization primitives will have a performance impact and can also bring a different type of problem into the picture.

Probably the best known result of a race condition is deadlock . A deadlock is where two or more threads are waiting for resources that the other waiting threads own – so each waiting thread waits forever.

There are four necessary conditions for deadlocks to occur. These were initially defined in a 1971 paper by Coffman, Elphick, and Shoshani [2].

Tasks claim exclusive control of the resources they require (“mutual exclusion” condition).
Tasks hold resources already allocated to them while waiting for additional resources (“wait for” condition).
Resources cannot be forcibly removed from the tasks holding them until the resources are used to completion (“no preemption” condition).
A circular chain of tasks exists, such that each task holds one or more resources that are being requested by the next task in the chain (“circular wait” condition).

Having these conditions in the code does not necessarily mean that deadlocks will occur, just that they could, and hence the relative timing of events in different threads can lead to this race condition happening.

There are several strategies for avoiding deadlocks, the simplest is to avoid the circular chain. This is normally achieved by specifying a particular order in which locks must be acquired if multiple locks are needed.

A less well known problem is that of livelock . A livelock is where two or more threads continuously execute but make no progress. For example two threads are both doing calculations while some value is true, but they are doing it in such a way that each invalidates the other’s test, causing both threads to continue for ever.

A real world example of live lock is two people walking towards each other in a corridor. Each person trying to be polite steps to the side, however they just end up mirroring the other person’s movement and end up moving from side to side. Luckily we don’t normally end up stuck forever in this predicament.

Priority Inversion

Priority inversion is more of a problem for real-time systems, but it can occur on any system where threads or processes can have priorities assigned to them. This ends up being a scheduling problem and occurs when a low priority task has ownership of a resource that a high priority task needs.

In these situations the low priority task ends up taking precedence over the high priority task. This can be complicated even more by there being another task running with a priority between the high and low task. The medium priority task takes precedence over the low priority task which delays even longer the releasing of the resource that the high priority task needs.

In many cases, priority inversion passes without much notice, but it can cause serious problems, as in the case with the “Mars Pathfinder” [3] where the craft was resetting the computer frequently.

Where To From Here?

We are now in the situation where multi-core processors are becoming commonplace. In order to write software that will make full use of new hardware, multithreading is going to be used more and more.

Languages that have defined memory models are considering them with respect to changes in hardware technology. Languages that don’t, like C++, are looking to define them. A particular interest is the possibility of adding language constructs that will allow the automatic use of threads to speed up loops.

Another area of research that is likely to see more interest is lock-free data structures and wait-free algorithms. Lock-free data structures are possible to write by using the atomic operations supplied by the operating system. However due to the more inherent complexities of non-trivial data structures, they are hard to get right.

Wait-free algorithms are of particular interest to real time systems due to the fact that if a high priority task is never blocked, then priority inversion does not happen.

Conclusion

If you are looking at using multiple threads first stop and think. Are they really necessary? Is the added complexity from threading going to reduce the complexity of the problem? Would separate processes be better? If after careful consideration you decide that they are in fact going to help, then go forward with care. Protect your shared data structures with mutual exclusion blocks or cunningly-devised lock-free algorithms. Avoid circular dependencies on mutexes, and use the correct synchronisation methods for the problem.

Tim Penhey
tim@penhey.net

References

http://foldoc.org
http://www.cs.umass.edu/~mcorner/courses/691J/papers/TS/coffman_deadlocks/coffman_deadlocks.pdf
http://research.microsoft.com/~mbj/Mars_Pathfinder/Authoritative_Account.html