Thread Pooling: An Investigation

When designing fast server applications, the decision to use a single thread or multiple threads of execution must be compared to the relative complexities of each implementation.

The single threaded approach should not be dismissed without a thorough investigation. On a low usage application that processes requests quickly it can provide an adequate solution. By using asynchronous I/O, reasonable rates of data throughput can be achieved. The big advantage of this approach is no shared resource contention problems. That means no thread synchronisation and no possible deadlock problem to track down.

The problem here is scalability. In an ideal world, running the application on a dual processor machine, you will handle twice as many requests. The single threaded solution will not scale this way; its single thread will plod on regardless of the number of processors. After persuading Big Company Limited to buy the great new Mega Tsunamiä software, that last thing they will want is the quad CPU server sitting at 10% utilisation.

In order to take advantage of the 4 CPU's at the applications disposal, multithreading is one solution. Another chance for the sales team as Mega Tsunami MT ™ is released. There are three approaches that can be taken:

A thread per request.
A pool of worker threads.
Adaptive Thread Pool.

The Mega Tsunami MTä application in this article is designed to listen for requests on a 'well known' port and process the request, replying with data if necessary. One thread will be created to listen on the port.

A Thread Per Request

This approach will scale well. As each client connects, the listening thread creates a new thread to process the request. By being careful with the design, synchronisation issues can be kept to a minimum as each request is cocooned within its own thread. This is important because it would be unfortunate to have Mega Tsunami MTä utilising only 20% of the quad-cpu server because most of the threads are waiting to acquire a shared resource.

There are drawbacks from using this approach. Although it is more efficient to create a thread rather than a process, this still takes time. On the plus side, that is all the listening thread has to do before listening again. If the application is to process thousands of requests a second, an underlying limit of the operating system will stop the application in its tracks. There will be an upper limit on the number of threads an operating system can create and schedule. If this is reached what action should be taken, as the connection cannot be accepted for processing. Earlier than this limit, the application is likely to become thread bound with context switches.

Context Switching

With multiple processes and threads running on the server, the operating system must maintain the current CPU registers, its context, for each one. When a thread is scheduled to run, the system must initialise the CPU with its context. The thread's execution will then continue from where it left off. This context switch is not free; some of the processor time must be spent performing the switch. If there are 200 live requests currently being serviced on the quad server, only four can run concurrently. The other 196 are waiting to be scheduled. If a thread is blocking on an I/O request, it will not be scheduled until the I/O operation completes. This is a partial saviour for the thread per request model. Assuming that the server does not just compute a result, but performs some I/O as well, some of the threads will be sleeping. The 200 threads will be in one of the following states: sleeping, executing or waiting.

A pool of worker threads

If only four threads can run concurrently, why create more? The application creates four worker threads and adds them to a pool of available threads. When a request is received, a thread from the pool is allocated to action it. When the processing is finished it is returned to the free pool. In itself this is not a good solution. Now if the server gets 200 requests only 4 will be serviced, the rest will either timeout or be processed in what appears to be a slow server from the clients point of view.

If a form of asynchronous I/O is applied to the worker threads it is now possible to get the 4 threads to serve more than 4 connections. After the thread processes the request it will return data. When this data is returned with asynchronous I/O the thread will not block, execution will continue as though the send data operation completed in 0 time. The thread can now wait for more connections and I/O completion notifications.

The underlying problem of 4 threads is still there, but they are now handling many more client connections. There will come a point, if the server receives a large number of requests, that connections will time out

Adaptive Thread Pool

So what is the solution? The application cannot just create threads on a whim and using one thread per processor will have problems with a large number of requests.

An adaptive thread pool would start with one thread per processor and with a large number of requests will create more threads up to an application defined upper limit. When the load on the server is reduced the number of threads in the pool would reduce to one thread per processor.

Adaptive Thread Pool Example

This example is going to be simple. The requirements are:

The worker threads function will be application defined.
The creation of the threads will be application defined.
The shutdown mechanism for the threads will be application defined.
The algorithm used to control the number of threads will be application defined.

There are two classes that need to be defined immediately, Thread and ThreadPool . Thread will be a base class for the user defined thread and the ThreadPool will contain a collection of threads. The Thread class will provide an interface to create, start, stop and pause the thread. The class will also define two pure virtual functions, stop and the thread entry function. This interface will allow any class derived from Thread to be run as part of the pool or as a stand-alone thread. The class definition is:

class Thread {
public:
  Thread();
  ~Thread();
  bool Start();
  bool Pause();
  
  bool Create(unsigned int createType = 
                    CREATE_SUSPENDED);
  // application defined
  // user defined callback to stop the
  // thread
  virtual bool Stop() = 0;
  // user defined callback for the thread 
  // proceedure
  virtual int ThreadProc() = 0;
private:
  HANDLE mThread;
};

By making the Thread class abstract it answers requirements 1 and 3. As the thread pool will contain threads it makes sense to provide the same control over the pool threads. In order to satisfy requirement 2, allow the application to create the threads, the thread pool must be passed a method of creating the threads. In the example I chose to use a function, why define a class when a function will do? The thread pool cannot operate without the function so it is passed as a parameter of the constructor.

// create a function for specific
// object creation
//
// Thread * CALLBACK CreateWorkerThread();
typedef Thread * (CALLBACK *CreateWorker)();
class ThreadPool
{
public:
  ThreadPool(CreateWorker cw);
  ~ThreadPool();
  bool Start();
bool Stop();
  bool Pause();
  
  // create the pool
  bool Create();
private:
  typedef std::list<Thread *> Threads;
  Threads mThreads;
  // user callback to create the threads
  CreateWorker mCw;
};

Solving requirement 4 requires a bit of thinking. The algorithm controlling the thread pool is application defined. How does it know the current number of threads, the numbers that are free, which are processing and those it should signal to die? By defining an interface class ( ThreadPoolAlgorithm ) that the application implements and having the ThreadPool class inherit from, information from the pool can be passed to the concrete algorithm. The easiest way to do this when the base class is undefined is using templates.

template<typename base>
class ThreadPool : public base

Now an instance of the pool can be created thus;

ThreadPool<MyAlgorithm> myPool;

The interface, ThreadPoolAlgorithm enables the framework to know what functions it can call. If the concrete implementation passed to the ThreadPool is not of this type a compile time error is generated. What should this interface contain?

class ThreadPoolAlgorithm {
public:
  // returns true if a thread is to be 
  // added
  virtual bool CanAddThread() = 0;
  // called when a thread is added to the
  //  pool
  virtual void ThreadAdded() = 0;
  // called when a thread is removed to
  // the pool
  virtual void ThreadRemoved() = 0;
  // returns true when thread should exit
  virtual bool CanThreadExit() = 0;
  // called when a thread starts
  // processing
  virtual void ThreadBusy() = 0;
  // called when a thread stops
  // processing
  virtual void ThreadFree() = 0;
  // shutdown thread pool
  virtual void Shutdown() = 0;
};

Why use an abstract base class (ABC)? There is no reason that one must be used, as long as a class supplies the implementations for the required functions. The ABC is there to remind the programmer what functions to supply.

The next question is how does the pool know the state of a thread? The thread should have someway to call back into the ThreadPool . Because there is no requirement to expose the inner workings of the pool, another interface class ( ThreadPoolControl ) needs to be defined.

class ThreadPoolControl{
public:
  virtual bool CanThreadExit() = 0;
  virtual void RemoveThread(Thread* t)=0;
  virtual void ThreadBusy () = 0;
  virtual void ThreadFree () = 0;
};

This interface exposes part of the thread pool and thread pool algorithm implementations to the Thread class. By changing the inheritance of ThreadPool to:

template<typename base>
class ThreadPool : public ThreadPoolControl, public base

And changing the thread constructor to accept a pointer to a ThreadPoolControl object, the thread can fulfil its requirement to let the pool know its state. It can also query if it should remain in the pool. If it should not it can let the algorithm know it is leaving. The thread base class has the following functions added to support this.

class Thread {
public:
  .
  .
  .
void ThreadBusy();
bool CanThreadExit();
void ThreadFree();
private:
  ThreadPoolControl * mController;
};

An example of the application defines ThreadProc is:

int PoolWorkerThread::ThreadProc() {
  bool finished = false;
  while(!finished)
  {
    DWORD ret = ::WaitForSingleObject(mEvent, 1000);
    if(WAIT_TIMEOUT == ret)
    {
      ThreadBusy();
      ::Sleep(500);
      finished = CanThreadExit();
      ThreadFree();
    }
    else
    {
      // event signalled ...
      finished = true;
    }
  }
  
  RemoveThread(this);
  return 0;    
}

In this example the thread waits for mEvent to be signalled. If the function times out, it will perform some simulated processing.

Shutting down the pool

Shutting down a thread pool is very much application dependent. The pool has no idea of the mechanism used by the thread to wait for action notifications. It could be using wait for events, I/O completion ports or an application-defined mechanism. The Stop method of the class Thread is pure virtual to force application defined threads to implement this stopping mechanism. If the ThreadProc function is waiting on multiple events, one of which is application stop, the stop function can signal this and return. The thread will receive the event when it is next in the wait for event state. In order to ensure that the thread pool algorithm does not try to create more threads, it is also informed of the shutdown event.

One possible implementation of the thread pool stop is:

bool ThreadPool::Stop(){
HANDLE * h = NULL;
  Shutdown();
  {
  // stop the threads
    Guard<CriticalSection>
                 guard(&mCritSec);
    Threads::iterator i;
    // signal stop and wait for all to 
    // exit
    int n = mThreads.size();
    h = new HANDLE[n];
    n = 0;
    for(i = mThreads.begin(); i !=
             mThreads.end(); ++i, ++n)
    {
      Thread * t = (*i);
      t->Stop();
      h[n] = t->operator HANDLE();
    }
  }
  ::WaitForMultipleObjects(n, h, 
                  TRUE, INFINITE);
  delete [] h;
}

In the world of Win32 this code will inform the algorithm that the pool is shutting down. Then it will signal each thread to stop and call WaitForMultipleObjects to wait for all of the threads to exit.

Conclusion

The approach described provides a reusable framework for both free standing threads and those that are part of a thread pool. The mechanism to control number of threads in the pool is defined by the application and easily used by the pool. An application could create a number of pools all using different strategies to control the number of threads. The framework requires no prior knowledge of the application thread functions and can therefore be packaged into a library to be used in many projects.

Finally thanks to Mark Radford and Martin Lucus for helping to turn a set of disjointed notes into ...

References

Advanced Windows (J Richer)

Object-Oriented Multithreading Using C++ (Hughes and Hughes)

MSDN (Microsoft)