Valgrind’s Dynamic Heap Analysis Tool: DHAT

Valgrind experimental tool DHAT is now official. Paul Floyd explains what this heap analysis tool is and how to use it.

Background

Is it really over 10 years since I last wrote an article on Valgrind? It is indeed [Floyd13]. Back then I wrote about the tools that make up the standard Valgrind toolkit. Since then, one of the experimental tools has been removed (exp-sgcheck, ‘experimental statics and globals check’, removed mainly because of excessive false positives). Another of the tools, exp-dhat has been promoted from the experimental category to being a first-class component. DHAT is the subject of this article. One other thing that’s happened in that period is that I’ve joined the rather informal team of Valgrind developers [Valgrind]. This means that I’ve progressed from believing that I know roughly how Valgrind works to being able to work on some bits and knowing that I don’t understand most of it.

About DHAT

DHAT is a tool that can give you insights into heap memory use that will allow you to make changes that will make your memory use more efficient.

Since DHAT is part of Valgrind it will only work on Linux, FreeBSD, Solaris (probably) and macOS (old versions only). I don’t know of any equivalent tool for Windows.

DHAT underwent a major reworking in Valgrind 3.15 (April 2019). In this change

The ‘experimental’ status was removed, and the tool name changed from exp-dhat to just dhat.
The command line options were simplified.
The tool output changed from the console to a file.
A web interface was added to view the results file and to allow sorting on different criteria.

If you are using Valgrind 3.14 or earlier, you should be able to follow this article, but you should expect your output to be different. You will probably want to set the --show-top-n to a value higher than the default (for instance, --show-top-n=500).

What is DHAT, exactly? It is a data profiler (the acronym stands for Dynamic Heap Analysis Tool) [DHAT]. I expect most readers are familiar with code profiling tools [Wikipedia] (like Callgrind, VTune, Quantify, Linux perf and others). As the Heap Analysis part of the name implies, DHAT performs profiling of memory accesses to blocks of heap memory.

DHAT doesn’t perform profiling of the amount of heap allocation (like Massif [Massif, Floyd12], another Valgrind tool, Flame Graphs [Gregg] generated with bcc or heaptrack [Github1]). For every heap allocated block, DHAT will count every read and write within that block. For larger blocks of memory of over 1024 bytes, it will just aggregate accesses to the blocks. For smaller blocks of 1024 bytes and less, it will also generate a map of access counts within the block. I don’t know of any tool that produces a whole-memory heat map, probably because that would have an excessive memory and run time overhead.

DHAT is somewhat difficult to use and works best for structures that get allocated individually on the heap. Having said that, I find it very useful, and I’m not aware of any other tools that perform the same task. There is one non-tool alternative: manual code instrumentation. The problems with manual instrumentation are:

You don’t necessarily know in advance which structures to instrument.
If you want to instrument every member of your structures, that will entail a lot of code.

Using DHAT

DHAT is quite simple to use.

Build your executable, preferably with debug information (adding -g to the build when using GCC or LLVM toolchains).
Run your executable with DHAT:

valgrind –tool=dhat {your exe name}

At the end of the run DHAT will print a summary of the run and instructions as to how to view the results. It will also have generated a results file dhat.out.PID where PID will be the number of the process ID when DHAT was running. The results file isn’t meant to be human readable.
Load the results following the instructions from step 2.

Be aware that DHAT, like all of the Valgrind tools, is very slow. I recommend that you only use it with scenarios that run in no more than a few minutes outside of Valgrind.

Example

Let’s look at a small example, starting with a data structure (Listing 1). I’m assuming 64bit desktop-style applications throughout the examples. The source code and an example of the results along with the DHAT viewer files can be found on GitHub [Floyd]. You can view the results on any platform with a web browser.

#include <string>
#include <list>
#include <iostream>
class TestClass
{
  int f1;
  double f2;
  std::string f3;
public:
  TestClass() : f1{}, f3{"small string"} {}
  int getF1() const { return f1; }
};

Listing 1

I have deliberately not initialized f2 in the constructor. I have also deliberately initialized f3 with a short string that will fit in libc++ ‘short string optimization’ (SSO). This means that allocating an instance of TestClass only needs one call to operator new. Normally when using DHAT you work backwards from the results to the source code and data structures. I’ll do that the other way round, working forwards from the code to the results, for explanatory reasons. What is the size of TestClass? That depends a bit. The structure has 8-byte alignment. So, the total size is:

  sizeof(int) + 4 hole + sizeof(double) + 
  sizeof(std:string)

The size of std::string depends on the platform. With clang++/libc++ it is 24. With g++/libstdc++ it is 32. Since I’m using FreeBSD amd64 and aarch64, the size that I see is 24, and the size of TestClass is 40. You can check your data structure layouts using a tool called pahole (part of the dwarves package [Github2]). To use pahole you need a binary with debug information. The tool reads the DWARF debug info from the binary and prints a summary of the layouts of all data structures that it finds, including a summary of any wasted space and which blocks of members fit in a cacheline. Figure 1 is the output for TestClass.

class TestClass {
    int           f1;           /*     0     4 */
    /* XXX 4 bytes hole, try to pack */
    double        f2;           /*     8     8 */
    string        f3;           /*    16    24 */
public:
    void TestClass(class TestClass *);
    int getF1(const class TestClass  *);
    void ~TestClass(class TestClass *);
    /* size: 40, cachelines: 1, members: 3 */
    /* sum members: 36, holes: 1, sum holes: 4 */
    /* last cacheline: 40 bytes */
};

Figure 1

The comments at the end of the lines with data members have two numbers. The first is the cumulative size so far and the second is the size of the member on that line. pahole is a great tool and I strongly recommend its use in conjunction with DHAT.

The second part of the example code is in Listing 2.

int main()
{
  std::list<TestClass> tc;
  std::cout << "Size of TestClass " 
            << sizeof(TestClass) << '\n';
  std::cout << "Size of std::string " 
            << sizeof(std::string) << '\n';
  for (int i = 0; i < 1000; ++i)
  {
    tc.emplace_back();
  }
  int s{};
  for (auto const& elem : tc)
  {
    s += elem.getF1();
  }
  std::cout << "s " << s << '\n';
}

Listing 2

This doesn’t do much. It prints out a couple of sizes to confirm what we saw with pahole. It adds 1000 default instances of TestClass to a std::list. It then iterates over the list reading and summing the f1 member. Finally, it outputs the sum, which will be 0 since f1 gets default value initialized.

Running the example

The output that I get is in Figure 2.

$ valgrind --tool=dhat ./main
==1148== DHAT, a dynamic heap analysis tool
==1148== Copyright (C) 2010-2024, and GNU GPL'd, by Mozilla Foundation et al.
==1148== Using Valgrind-3.25.0.GIT and LibVEX; rerun with -h for copyright info
==1148== Command: ./main
==1148==
Size of TestClass 40
Size of std::string 24
s 0
==1148==
==1148== Total:     60,096 bytes in 1,001 blocks
==1148== At t-gmax: 60,096 bytes in 1,001 blocks
==1148== At t-end:  4,096 bytes in 1 blocks
==1148== Reads:     29,080 bytes
==1148== Writes:    58,040 bytes
==1148==
==1148== To view the resulting profile, open
==1148==   file:///home/paulf/tools/valgrind/libexec/valgrind/dh_view.html
==1148== in a web browser, click on "Load...", and then select the file
==1148==   /home/paulf/scratch/accu/accu_dhat/dhat.out.1148
==1148== The text at the bottom explains the abbreviations used in the output.

Figure 2

Lines that start with ==1148== are the console output from DHAT. The other lines are from the ‘main’ test executable. We can see most of what is happening from the summary. The Total is the total amount of memory allocated and the number of allocated blocks. I’ll skip a line to t-end. DHAT uses its own terminology that can take some getting used to. t-end is at program end, and at that point there is one block of 4096 bytes. That block is allocated by libc by fwrite during the call to std::cout and FreeBSD libc does not free it.

Getting back to the Total, if fwrite uses 4096 bytes in 1 block that leaves 56000 bytes in 1000 blocks for main(). That is exactly what I was expecting. 1000 elements get added to the list, so each element is 56 bytes. We’ve already seen that TextClass is 40 bytes. The other 16 bytes are used by the next and previous pointers of the std::list nodes. t-gmax is the value at the global maximum, and it happens to be the same as the Total. Finally, there are the totals of the numbers of bytes read and written. The number of bytes written are roughly the same as the number of bytes allocated, which makes sense. I’m not sure where all the bytes are being read. I expect that the list traversal to calculate them s reads the list next (8 bytes) and f1 (4 bytes) and the list destructor also does another traversal. That’s 20 bytes. I guess that there is a 1 byte read per element to work out if the f3 string needs to be deleted or not. There must be one more 8-byte read per element somewhere, giving a total of 29 per TestClass instance.

Viewing the results

I followed the instructions and opened the link in Firefox.

Note the Legend. I’ll cover the Sort metric drop-down later.

Clicking Load… and opening a results file gives a complex screen even for this small example, so I’ll break it up into small pieces.

Off to an easy start. That’s just a summary of the executable and the PID that ran.

This is still quite simple. Times are really instruction counts, and this tells us when the peak memory occurs, and the total number of instructions executed.

Now for the hard bit. Before I treat you to some pretty colours¹, I need to make a stab at explaining what DHAT is doing. Basically, it is just doing two things.

Recording heap allocations (address, length, callstack). I’ll call these allocation contexts.
Counting accesses to the heap allocations.

DHAT calls these allocation contexts ‘Program Points’ (PPs). The PPs get organized as a tree. The root of the tree represents the entire execution of the executable. Each PP is colour coded with darker colours meaning more blocks or memory. There is a threshold of 1% below which PPs do not get displayed.

Below the root there are three kinds of PP nodes:

The root node, coloured like the interior nodes.
Interior nodes. These are for allocation contexts that also contain other allocation contexts. They are coloured yellow if their child nodes are collapsed and blue of their child nodes are expanded.
Leaf nodes for functions that allocate by do not call any other allocating functions. They are colour coded green.

In the example that I’m using there is only a root node and a leaf node.

The following few pictures are of the root node. Before taking the pictures, I collapsed the children, making this yellow. Unfortunately, the viewer does not allow line wrapping.

This looks quite like what we saw in the summary on the console, with some extra information.

Section	Data	Meaning
Total	Bytes 10,899.38/Minstr	This is how many bytes get allocated per million instructions. Lower is better
Total	Blocks 181.55/Minstr	This is the number of memory blocks allocated per million instructions. Lower is better.
Total	avg size 60.04 bytes	The average size per allocation.
Total	avg lifetime 621,256.21 instrs	This is the average number of instructions per block between allocation and deallocation. Lower is better, also shown as a %.
Reads	5,274.13/Minstr	The average number of reads per million instructions. Higher is better. Very low or zero indicates a problem.
Reads	0.48/byte	The average number of reads per byte allocated.
Writes	10,526.49/Minstr	The average number of writes per million instructions. Higher is better. Very low or zero indicates a problem. A value of one may mean objects are getting constructed and initialized and having no subsequent writes.
Writes	0.97/byte	The average number of writes per byte allocated.

Not so bad? On with an interior node.

This is quite similar to the root PP node for most of the information. In order for the text to fit the Total line has been truncated as have the standard library function names in the Allocated at section. The Total line is similar to the previous PP. There are a few extras.

The Max line, showing the maximum memory for that leaf PP.

A summary of Accesses. This is the sum of all accesses for all allocations done at that callstack. This does not distinguish between reads and writes. This displays 32 bytes on a line with the access count for each byte. Ditto marks mean that the count is the same as the previous byte. A dash means a count of zero. The first 8 bytes have a count of 3002, probably the list previous pointer. The next 8 bytes were accessed 5001 times, probably the list next pointer. Then there are 4 bytes with an access count of 2000 – that’s the f1 member, each is zero initialized and read once in the sum loop. After that there are 12 bytes without any accesses. 4 of those bytes are the hole in the structure and 8 are for double f2 that I deliberately did not initialize. The second line is the std::string f3. I guess that the first byte is being used as a tag to indicate SSO use with an access count of 2000. Then there are 13 bytes with an access count of 1000 corresponding to "small string\0". Lastly there are 10 bytes with an access count of 0, the unused bytes in the SSO std::string. There isn’t much that can be done in that case. Note that the histogram or access map is only produced for allocations of 1024 bytes and less. This means that you won’t see these maps for any large array-type allocations (like std::vector).

The third thing is that there is the callstack that tells you where the allocations of this kind were done.

Sorting

Now I’ll get back to the Sort metric dropdown list. This allows you to change how the display is ordered and filtered. Using this you can concentrate on specific things like peak memory, small allocations, high and low access rates.

Larger ‘access’ maps

If you see a block of memory that is too big for the 1024-byte access map limit, but you would still like to look ‘inside’ it to see how it is being used there is way. You will need to instrument the code to enable this.

The first thing that you need to do is to include valgrind/dhat.h.

Secondly you need to use the DHAT_HISTOGRAM_MEMORY Valgrind client request macro, for instance:

  std::vector<uint8_t> vec(2000, 0);
  DHAT_HISTOGRAM_MEMORY(vec.data());

The macro just takes the address of the allocated block. In the example above, the limit has been raised to 2000. There is still a hard coded limit of 25600 on these user-specified access maps (25× the normal default).

This is still fairly limited for general use with C++ containers. For instance, if you have an std::vector that is not allocated up-front like in the example above then it’s tricky to know when the allocated memory needs to grow and the new allocation flagged for profiling. You could track the vector capacity(). Or you could write a custom allocator – please contact me if you do!

Using the results

To round off this article, here are some ways that you can use DHAT.

Identify small, short-lived allocations and convert them to using the stack.
Identify ‘dead data’ (like dead code). These are data fields and entire structures that are never used. You may need to run several tests to get better ‘data coverage’ (like code coverage).
Improving cache hit rate. Look for high access counts with similar values in the access map that are more than 2 text lines in the report apart (corresponding to 64 bytes or 1 cacheline). Use pahole as a check, and tools like Linux perf stat and perf record to verify any performance changes.
Reduce the peak memory. Look for large allocations that have a long lifetime and see if that memory can be freed earlier. The kind of change that you will be looking to make is to patterns that look like
alloc A; use A; alloc B; use B; free A; free B;

where A is no longer needed after ‘use A’. This can be transformed into

alloc A; use A; free A; alloc B; use B; free B;

Don’t forget that the ‘free’ might be due to the implicit destructor of a standard library container stored in an automatic variable. That means that ‘free A;’ may mean that you need to take explicit actions like A.clear(); A.shrink_to_fit(); and ‘free B’ may just be the end of the scope.

Conclusion

In my opinion DHAT is a little-known hidden gem amongst the Valgrind tools. It is very slow, and the results can be difficult to read. There are no alternatives that I am aware of (other than instrumenting your own code to do the same sort of things).

References

[DHAT] DHAT: https://valgrind.org/docs/manual/dh-manual.html

[Floyd] Paul Floyd, DHAT viewer files (repo: paulfloyd/accu_dhat) https://github.com/paulfloyd/accu_dhat

[Floyd12] Paul Floyd, ‘Valgrind Part 5 – Massif’ in Overload 112, December 2012, available at https://accu.org/journals/overload/20/112/floyd_1884/

[Floyd13] Paul Floyd, ‘Valgrind Part 6 – Helgrind and DRD’ in Overload 114, April 2013, available at https://accu.org/journals/overload/21/114/floyd_1867/

[Github1] Heaptrack: https://github.com/KDE/heaptrack

[Github2] Dwarves: https://github.com/acmel/dwarves

[Gregg] Brendan Greg, ‘Memory Leak (and Growth) Flame Graphs’, available at https://brendangregg.com/FlameGraphs/memoryflamegraphs.html#Linux

[Massif] Valgrind user manual: https://valgrind.org/docs/manual/ms-manual.html

[Valgrind] Valgrind developers: https://valgrind.org/info/developers.html

[Wikipedia] List of performance analysis tools: https://en.wikipedia.org/wiki/List_of_performance_analysis_tools#C_and_C++

Footnote

If you access the online version of this article, all the screenshots are in colour.

Paul Floyd has been writing software, mostly in C++ and C, for about 30 years. He lives near Grenoble, on the edge of the French Alps and works for Siemens EDA developing tools for analogue electronic circuit simulation. In his spare time, he maintains Valgrind.