ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Google+ ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinValgrind Part 6 – Helgrind and DRD

Overload Journal #114 - April 2013 + Programming Topics   Author: Paul Floyd
Debugging multi-threaded code is hard. Paul Floyd uses Helgrind and DRD to find deadlocks and race conditions.

The marketing department has promised a multithreaded version of your application, which was designed ten years before threads were available in commercial OSes [No Bugs]. So you use Callgrind [callgrind] to identify your program’s hotspots and set about retrofitting threads to the code. You know a bit about threads, and manage to get a decent speed up. Unfortunately, the results are intermittently incorrect and the program occasionally locks up. Because the problems seem non-deterministic, they are hard to debug. What to do? Valgrind to the rescue, again. Valgrind includes not one but two tools for diagnosing thread hazards, Helgrind and DRD. Why two? They perform substantially the same work, but Helgrind tends to produce output that is easier to interpret whilst DRD tends to have better performance. DRD means Data Race Detector and I’m not sure where the name Helgrind comes from, maybe it just fits in with the other -grind suffixes.

Both tools work with pthreads and Qt 4 threads (not yet Qt 5, but that should be in Valgrind 3.9 and I believe it is in the current development version of Valgrind under Subversion). In order to detect hazards they record memory access ordering and also thread locking. If you use some other threading library or write your own threading functions then you will need to write your own wrappers in order to be able to use these tools.

Since I’ve mentioned Qt, that leads me to a slight digression on C++ and dynamic libraries. In general, Valgrind tools intercept various C library calls like malloc. In the case of C++, that interception has to be done with the mangled symbols. Valgrind also needs to know the soname of the containing dynamic library (the soname is the library name registered within the library file). With Qt 5, libQt4Core.so became, unsurprisingly, libQt5Core.so.

I will assume that you have a working knowledge of programming with threads. See [threads references] if you need a refresher. In this article, I’ll use simple examples of API misuse, race conditions and deadlock.

The basic mechanisms used for the tools are to record reads and writes at the machine code emulation level plus to intercept calls to the thread synchronization functions. The relative order of reads and writes on different threads can be analysed and if they are not within appropriate synchronization barriers, an error is detected. Locking is not inferred from the machine code, which is why interception is required.

I’ll start with a noddy example of pthreads abuse (see Listing 1).

// noddy.cpp
#include <pthread.h>
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

int main()
{
  pthread_mutex_unlock(&mutex);
}
			
Listing 1

You can’t get much easier than that, example wise. The command line is equally easy:

  valgrind --tool=helgrind -v --log-file=hg.log
  ./noddy

The resulting log file starts with about 70 lines of information about Valgrind, shared libraries and symbols. I won’t include it here. The important part is shown in Figure 1.

==2573== ---Thread-Announcement------------------------------------------
==2573==
==2573== Thread #1 is the program's root thread
==2573==
==2573== ----------------------------------------------------------------
==2573==
==2573== Thread #1 unlocked an invalid lock at 0x601040
==2573==    at 0x4C2EDBE: pthread_mutex_unlock (hg_intercepts.c:609)
==2573==    by 0x400659: main (noddy.cpp:6)
			
Figure 1

So we know which thread we’re on, and we’re informed of the misuse of pthread_mutex_unlock.

Let’s run the same example with DRD (see Figure B):

  valgrind --tool=drd -v --log-file=drd.log ./noddy

The output is a little bit terser (see Figure 2), but the same information is there.

==2570== The object at address 0x601040 is not a mutex.
==2570==    at 0x4C3180E: pthread_mutex_unlock (drd_pthread_intercepts.c:703)
==2570==    by 0x400659: main (noddy.cpp:6)
			
Figure 2

Next up an example with a race condition. Basically this uses a usleep to pretend to do some work [with the nice property that Valgrind sleeps at the same speed as the application running at full speed]. However, there’s a static variable in the third file that I accidentally ‘forgot’. When I run the application, I usually get the correct result, but about a quarter of the time the result is 1 or 2 too low (see main.cpp in Listing 2, worker.cpp in Listing 3 and other.cpp in Listing 4).

// main.cpp
#include <pthread.h>
#include <iostream>

void* worker_thread(void* in);
int backdoor();

int main()
{
  pthread_t t1, t2;
  pthread_create(&t1, 0, worker_thread,
     reinterpret_cast<void*>(7));
  pthread_create(&t2, 0, worker_thread,
    reinterpret_cast<void*>(13));

  pthread_join(t1, 0);
  pthread_join(t2, 0);

  std::cout << "Forgotten static is "
            << backdoor() << " ought to be "
            << 1000000/7 + 1000000/13 + 2 << "\n";
}
			
Listing 2
// worker.cpp
void other_work(int n, int i);

void* worker_thread(void* in)
{
  int number = reinterpret_cast<long>(in);
  for (int i = 0; i < 1000000; ++i)
  {
    other_work(number, i);
  }
  return 0;
}
			
Listing 3
// other.cpp
#include <unistd.h>

static int forget_me_not;

void other_work(int n, int i)
{
  if (i % n == 0)
  {
    ++forget_me_not;
  }
  else
  {
    ::usleep(1);
  }
}

int backdoor()
{
  return forget_me_not;
}
			
Listing 4

This is a simple example of omission to protect some shared resource from threaded write access. In real world problems, the problem could also be inappropriate protection, such as a write within a section that is only protected by a read lock.

Now let’s run Helgrind.

 valgrind --tool=helgrind -v    --log-file=helgrind.log ./test

Then it tells us about the number of threads created. Two, as expected (see Figure 3).

==2678== ---Thread-Announcement------------------------------------------
==2678==
==2678== Thread #3 was created
==2678==    at 0x594928E: clone (in /lib64/libc-2.15.so)
==2678==    by 0x4E3BFF4: do_clone.constprop.4 (in /lib64/libpthread-2.15.so)
==2678==    by 0x4E3D468: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.15.so)
==2678==    by 0x4C2E27F: pthread_create_WRK (hg_intercepts.c:255)
==2678==    by 0x4C2E423: pthread_create@* (hg_intercepts.c:286)
==2678==    by 0x400A6A: main (main.cpp:11)
==2678==
==2678== ---Thread-Announcement------------------------------------------
==2678==
==2678== Thread #2 was created
==2678==    at 0x594928E: clone (in /lib64/libc-2.15.so)
==2678==    by 0x4E3BFF4: do_clone.constprop.4 (in /lib64/libpthread-2.15.so)
==2678==    by 0x4E3D468: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.15.so)
==2678==    by 0x4C2E27F: pthread_create_WRK (hg_intercepts.c:255)
==2678==    by 0x4C2E423: pthread_create@* (hg_intercepts.c:286)
==2678==    by 0x400A4F: main (main.cpp:10)
==2678==
			
Figure 3

So far so good. Then it detects a possible data race (see Figure 4).

==2678== ----------------------------------------------------------------
==2678==
==2678== Possible data race during read of size 4 at 0x602198 by thread #3
==2678== Locks held: none
==2678==    at 0x400B5B: other_work(int, int) (other.cpp:9)
==2678==    by 0x400BAE: worker_thread(void*) (worker.cpp:8)
==2678==    by 0x4C2E40C: mythread_wrapper (hg_intercepts.c:219)
==2678==    by 0x4E3CE0D: start_thread (in /lib64/libpthread-2.15.so)
==2678==    by 0x59492CC: clone (in /lib64/libc-2.15.so)
==2678==
==2678== This conflicts with a previous write of size 4 by thread #2
==2678== Locks held: none
==2678==    at 0x400B64: other_work(int, int) (other.cpp:9)
			
Figure 4

Now, when it says ‘possible’, don’t fool yourself into thinking ‘oh, that will never happen’. If the threads run for long enough simultaneously, the chances are that it will. However, there is some chance that you might be lucky. Since Valgrind uses an abstract model for the emulation, it isn’t always identical to the underlying CPU. In particular, Intel/AMD perform less reordering than platforms like POWER which can perform stores out of order. So whilst there may be no error on Intel, the same code could could fail on POWER.

In this case, it is a read/write hazard. There is also a write/write hazard (I’ll just show the top of the stack for brevity in Figure 5).

==2678== Possible data race during write of size 4 at 0x602198 by thread #3
==2678== Locks held: none
			
Figure 5

The log ends with a summary of the test.

Now the same test, but using DRD.

valgrind --tool=drd -v --log-file=drd.log ./test

This generates similar header information and again detects the two read/write and write/write race conditions (only the read/write one shown in Figure 6).

==2669== Thread 3:
==2669== Conflicting load by thread 3 at 0x00602198 size 4
==2669==    at 0x400B5B: other_work(int, int) (other.cpp:9)
==2669==    by 0x400BAE: worker_thread(void*) (worker.cpp:8)
==2669==    by 0x4C2E2A4: vgDrd_thread_wrapper (drd_pthread_intercepts.c:355)
==2669==    by 0x4E47E0D: start_thread (in /lib64/libpthread-2.15.so)
==2669==    by 0x59542CC: clone (in /lib64/libc-2.15.so)
==2669== Allocation context: BSS section of /home/paulf/vg_thread/test
==2669== Other segment start (thread 2)
==2669==    at 0x4C3180E: pthread_mutex_unlock (drd_pthread_intercepts.c:703)
==2669==    by 0x4C2E29E: vgDrd_thread_wrapper (drd_pthread_intercepts.c:236)
==2669==    by 0x4E47E0D: start_thread (in /lib64/libpthread-2.15.so)
==2669==    by 0x59542CC: clone (in /lib64/libc-2.15.so)
==2669== Other segment end (thread 2)
==2669==    at 0x5925CAD: ??? (in /lib64/libc-2.15.so)
==2669==    by 0x594E6B3: usleep (in /lib64/libc-2.15.so)
==2669==    by 0x400B75: other_work(int, int) (other.cpp:13)
==2669==    by 0x400BAE: worker_thread(void*) (worker.cpp:8)
==2669==    by 0x4C2E2A4: vgDrd_thread_wrapper (drd_pthread_intercepts.c:355)
==2669==    by 0x4E47E0D: start_thread (in /lib64/libpthread-2.15.so)
==2669==    by 0x59542CC: clone (in /lib64/libc-2.15.so)
			
Figure 6

There are two differences with Helgrind. Firstly, it tells us where the conflicting memory was allocated (in this case, BSS, a file static variable). The first location is given precisely, but the second one only gives a starting and ending range. In this case, between the start of the thread and line 13 of other.cpp.

For my final example which illustrates deadlock, I’ll use Qt 4 this time. Deadlock occurs when you have cycles in the graph that represents your lock/unlock states, or conversely, if your lock/unlock states form a tree, you won’t have any deadlocks. This is described in [correct parallel algorithms]. If you have two or more locks, then always lock/unlock them in the same order. You won’t have any deadlocks, though you may get thread starvation, a lot of contention and poor performance.

There are several ways of creating threads with Qt and I’ll use inheritance for simplicity. I’m using explicit locks and unlocks in order to make the deadlocking more obvious. In practice, I’d recommend using QmutexLocker as it will make your code shorter and safer (see Listing 5).

// main.cpp
#include <QThread>
#include <QMutex>
#include <iostream>

QMutex m1;
static int resource1;

QMutex m2;
static int resource2;

class MyThread1 : public QThread
{
public:
  void run();
};

static int LOOPS = 10000;

void MyThread1::run()
{
  for (int i = 1; i < LOOPS; ++i)
  {
    m1.lock();
    usleep(10);
    m2.lock();
    if (resource1 % 3)
      ++resource2;
    usleep(10);
    m2.unlock();
    m1.unlock();
  }
}
class MyThread2 : public QThread
{
public:
  void run();
};

void MyThread2::run()
{
  for (int i = 1; i < LOOPS; ++i)
  {
    m2.lock();
    usleep(10);
    m1.lock();
    if (resource2 % 2 == 0)
      ++resource1;
    usleep(10);
    m1.unlock();
    m2.unlock();
  }
}

int main()
{
  MyThread1 t1;
  MyThread2 t2;

  t1.run();
  t2.run();

  t1.wait();
  t2.wait();

}
			
Listing 5

Let’s run that with Helgrind. I built it using Qt Creator, which made a funky directory for me

  valgrind  --tool=helgrind -v --log-file=qthg.log
 ./QtExample-build-desktop-Qt_4_8_1_in_PATH__
System__Debug/QtExample

Figure 7 shows the output from Helgrind.

==13045== Thread #1: lock order "0x602160 before 0x602168" violated
==13045==
==13045== Observed (incorrect) order is: acquisition of lock at 0x602168
==13045==    at 0x4C2CA24: QMutex_lock_WRK (hg_intercepts.c:2058)
==13045==    by 0x4C30715: QMutex::lock() (hg_intercepts.c:2067)
==13045==    by 0x401228: MyThread2::run() (main.cpp:44)
==13045==    by 0x4012C5: main (main.cpp:61)
==13045==
==13045==  followed by a later acquisition of lock at 0x602160
==13045==    at 0x4C2CA24: QMutex_lock_WRK (hg_intercepts.c:2058)
==13045==    by 0x4C30715: QMutex::lock() (hg_intercepts.c:2067)
==13045==    by 0x40123C: MyThread2::run() (main.cpp:46)
==13045==    by 0x4012C5: main (main.cpp:61)
==13045==
==13045== Required order was established by acquisition of lock at 0x602160
==13045==    at 0x4C2CA24: QMutex_lock_WRK (hg_intercepts.c:2058)
==13045==    by 0x4C30715: QMutex::lock() (hg_intercepts.c:2067)
==13045==    by 0x40118A: MyThread1::run() (main.cpp:23)
==13045==    by 0x4012B9: main (main.cpp:60)
==13045==
==13045==  followed by a later acquisition of lock at 0x602168
==13045==    at 0x4C2CA24: QMutex_lock_WRK (hg_intercepts.c:2058)
==13045==    by 0x4C30715: QMutex::lock() (hg_intercepts.c:2067)
==13045==    by 0x40119E: MyThread1::run() (main.cpp:25)
==13045==    by 0x4012B9: main (main.cpp:60)
			
Figure 7

There is a first error that is a false positive. See https://bugs.kde.org/show_bug.cgi?id=307082 which says that pthread_cont_init is not handled, while the second error is the deadlock.

DRD does not detect any error with this example. Unfortunately I’ve run out of time to get an explanation for this article.

So, in summary, multithreaded programming is difficult to debug, but Valgrind can make it easier.

In this series I’ve covered the main tools that make up Valgrind. However, there’s still more. Valgrind contains a few experimental tools like checking stack and global variables and a data access profiler. Even after that there is still more. Since Valgrind is open source, there are several third party extensions available, most notably including support for Wine and another variation on thread checking, ThreadSanitizer [thread sanitizer]. I have no experience with them – yet.

References

[callgrind] Overload 111, ‘Valgrind Part 4: Cachegrind and Callgrind’, Paul Floyd

[correct parallel algorithms] Overload 111, ‘A DSEL for Addressing the Problems Posed by Parallel Architectures’, Jason McGuiness and Colin Egan

[No Bugs] They didn’t listen to ‘No Bugs’ bunny. Overload 97, ‘Single-Threading: Back to the Future?’ and Overload 98, ‘Single-Threading: Back to the Future? Part 2’, Sergey Ignatchenko

[thread sanitizer] http://code.google.com/p/data-race-test/wiki/ThreadSanitizer

[threads references]

1. Multithreaded Programming With Pthreads, (ISBN 10: 0136807291 / ISBN 13: 9780136807292), Lewis, Bil, Berg, Daniel J., Sun Microsystems Press

2. Programming with POSIX Threads, (ISBN-10: 0201633922 / ISBN-13: 978-0201633924), David R. Butenhof , Addison-Wesley Professional

Overload Journal #114 - April 2013 + Programming Topics