Using ‘new’ without care can be slow. Paul Floyd uses Godbolt’s compiler explorer to see what happens when you do.
There are two influences that have inspired me to write this article. The first is that I’ve been playing a lot with Compiler Explorer ( https://godbolt.org ). Secondly, a while back I read Optimized C++ by Kurt Guntheroth. It contains a chapter on using dynamic memory (Chapter 6: Optimize Dynamically Allocated Variables).
I agree with a lot of what is said. In short, there is a description of the types of memory available in C++ (automatic, dynamic and static); a description of how this memory relates to variables in code; APIs to deal with dynamic memory; smart pointers and many tips on optimizations related to dynamic memory. In this article I’m going to explore why you should be trying to optimize use of
new
by digging down to the machine code.
When I’m looking at production C++ code I do see a lot of gratuitous uses of
new
, for instance
list <handle>* handleList = new list <handle>; ... processList(handleList);
I suspect that there are a few possible reasons for writing such code:
- the influence of Java
- the assumption that where an API takes a pointer, you have to pass it a pointer allocated on the heap
-
not realizing that, at least for standard library containers, the object itself is quite small and that the bulk of the memory (what is contained) is dynamically allocated. A
std::list
is only 24 bytes, for instance (on 64bit Linux with GCC).
Obviously, in the second case the memory doesn’t need to be allocated dynamically. The top-level object can perfectly well be on the stack. For instance, the above example could have been written:
list <handle> handleList; ... processList(&handleList);
In addition, if I could change the interface to the
processList
API, I would almost certainly change it to take a reference rather than a pointer. Furthermore,
std::list
is rarely a good choice when it comes to performance, so I’d probably also change that if I could.
Both versions of the code do the same thing. So, what is wrong with the first version?
Memory refresher
For those of you who are a bit rusty on what the difference is between dynamic and automatic memory (I’ll skip static), here is a quick refresher (Wikipedia has a longer description with diagrams [
Wikipedia
]). Firstly, they are also commonly also known by other names, referring to how they are often implemented. Dynamic allocation is also known as heap allocation, and automatic allocation is known as stack allocation. The stack is a large block of memory. It is referred to by CPU registers such as the Stack Pointer. Memory can be ‘allocated’ on the stack very quickly simply by manipulating the stack registers. The main drawbacks of stack memory are that it does not persist beyond the current scope and it can be quite limited in size. Heap memory is a separate block of memory, but this time it is controlled via functions like
malloc
[
C++Ref-a
] and
operator new
[
C++Ref-b
]. It’s not so limited in size and persists until explicitly deallocated.
Problems with new
Performance
There are a couple of reasons why heap allocation has worse performance than stack allocation.
- More memory is required – the allocator generally needs a small amount of memory for housekeeping in addition to the memory requested by the caller. In addition, the amount of memory might get rounded up to a larger size like the nearest power of 2 whilst stack allocation probably only rounds up to the nearest machine word size. On 64bit Linux, allocations have a 16byte minimum size and there is an 8byte housekeeping overhead.
-
The biggest difference is that functions like
new
anddelete
are relatively slow. Much effort by library and OS writers has been made to make them as fast as possible (for instance, see the history of jemalloc [ github-a ]).
There is a third question, concerning the size of code that gets generated. This complicates the picture because it isn’t always an apples and pears comparison. When you use stack allocation, it generally means that you are using RAII and the compiler will generate the code necessary for clean-up at the end of the scope. When you use heap allocation with raw pointers as above then it’s up to you to ensure that resources get freed.
About the example code
In the examples that follow, I will continue to use
handleList
. In my testing I defined
Handle
to be
class Handle { public: int data; };
It doesn’t matter what
Handle
is. The only thing of importance is to consider that
handleList
itself is something that needs some memory. I’m going to stick with the
std::list
in the examples for two reasons. Firstly, I want there to be a code smell. Secondly and more seriously, though the examples presented here are trivial I don’t want them to be so small that the compiler optimizes them to almost nothing. Also, for that reason I’ve added calls to an externally defined
populateList
function.
The assembly was generated on Compiler Explorer using GCC 7.3 64bit.
Comparison of allocation methods
Digging deeper into the different allocation methods, if we have a stack allocation function like this:
void processStack() { list<Handle> handleList; populateList(&handleList); }
the optimized machine code that gets generated is in Table 1
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 1 |
Note that for non-exceptional flow, only items in blocks A to F in column 3 are executed. Block F only gets called when exceptions are thrown via the stack unwinding mechanism.
On the other hand, for heap allocation that does not ensure proper clean-up, like this:
void processHeap() { list<Handle>* handleList = new list<Handle>; populateList(handleList); delete handleList; }
the machine code that gets generated is in Table 2.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 2 |
Thus, it has lost exception safety but made the generated code slightly shorter.
To regain exception safety with heap allocation, still using raw pointers, we would need to write something like Listing 1.
void processHeapNoleak() { list<Handle>* handleList = nullptr; try { handleList = new list<Handle>; populateList(handleList); delete handleList; } catch (...) { delete handleList; throw; } } |
Listing 1 |
That doesn’t look too pretty. Please don’t do this at home. The generated machine code for this is in Table 3.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 3 |
Moving on quickly, let’s consider a fourth alternative, using a smart pointer.
void processHeapSmartPtr() { auto handleList = make_unique<list<Handle>>(); populateList(handleList.get()); }
OK, that’s code that I could live with.
Note that if you want to use only the smart pointer and not the underlying raw pointer you would have to rewrite or overload
populateList
. When I tried this, I noticed that passing a reference to the
unique_ptr
prevented the compiler from inlining and optimizing the pointer use resulting in more register use and larger code. Furthermore, the CppCoreGuidelines discourage passing references to smart pointers in cases like this [
github-b
]. When
get()
is used, there is no interface issue and the code size is barely any larger than the stack version.
The code flow in this case is quite similar to
processStack
.
The machine code for this function is in Table 4.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 4 |
You may have noticed that the non-exception path for the three versions using the heap are the same (lines 1 to 25 and blocks A to D in the tables of assembler). The only thing that is different is how they handle exceptions.
Here is a summary of the size of the code generated. The byte sizes were obtained by
nm
Performance
I did some measurements of these 4 functions. I just wrote a
main()
with a loop running a million times calling the 4 functions with empty stub
populateList
functions.
With Valgrind callgrind, I got the following numbers of op-codes executed per call.
Function | Op-codes executed |
---|---|
processStack | 23 |
processHeap | 27 |
processHeapNoLeak | 27 |
processHeapSmartPointer | 29 |
These are the exclusive counts i.e., only for the functions themselves. Whilst callgrind counts every instruction, by default it doesn’t record functions that take less than 1% of the total. So, adding a loop is an easy way to ensure that they are included in the output. I get a slightly higher count for
processHeapSmartPointer
as I was doing these tests with GCC trunk, I expect that if I’d used GCC 7.3 the count would have been the same as the other two Heap functions.
This is pretty much what I was expecting
-
processStack
is the fastest but not the smallest due to the exception handling -
processHeap
is the smallest because it does no exception handling - All of the functions using the heap execute similar numbers of machine instructions.
The picture is very different for the inclusive counts, that is the functions plus any callees.
Function | Op-codes executed |
---|---|
processStack | 24 |
processHeap | 360 |
processHeapNoLeak | 360 |
processHeapSmartPointer | 362 |
There are two things that stand out
- The number of op-codes executed hasn’t change for the stack version, other than the call to the empty stub.
- The 3 heap versions have essentially the same count.
As usual, you may get different results on different platforms/compilers – I tried a couple of others and the results were broadly similar. Furthermore, this test case just uses stub functions. With real functions that actually do something the cost of the heap allocation would become relatively smaller. That said, the stack allocation here is around 15 times faster.
Conclusions
I think that the case is more or less settled. Use stack allocation where you can. It’s safer, faster and requires writing the least code. Obviously, there are times when you need heap allocation:
- when you have huge data
- when you need extended object lifetime
- when you have self-referential data structures like graphs
- when you want to decouple interfaces like with the pimpl idiom
- when you have deeply recursive functions.
When you have to use heap allocation, use smart pointers. There is a small code size overhead, but if you use
make_unique
(or
make_shared
) then the difference in time performance is negligible compared to using smart pointers with the benefit of not having to worry about resource leaks.
Acknowledgements
Thanks to the reviewers for pointing out my omissions and inconsistencies.
Thanks also to Matt Godbolt for providing Compiler Explorer.
References
[C++Ref-a] malloc: http://en.cppreference.com/w/cpp/memory/c
[C++Ref-b] operator new: http://en.cppreference.com/w/cpp/memory/new
[github-a] jemalloc: https://github.com/jemalloc/jemalloc/wiki/Background
[github-b] CppCoreGuidelines: http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#r30-take-smart-pointers-as-parameters-only-to-explicitly-express-lifetime-semantics
[Wikipedia] Data segment: https://en.wikipedia.org/wiki/Data_segment