Making programs run as efficiently as possible is a popular topic in C++; it being a fairly low-level language, it is indeed particularly well suited to mapping algorithms to the architecture that evaluates them with minimal overhead. Making that mapping as optimized as possible is a vast domain, which this talk merely introduces by presenting an overview compiling various techniques, their reasoning, when they can be applied, and the challenges associated with generalizing them throughout your source code.
The first part of this talk will focus on understanding the architecture, from which we will deduce what properties code needs to satisfy in order to map efficiently to it, and will cover aspects such as NUMA, multi-core, superscalar execution, instruction pipelining, specialized processing units, caching behavior and branch prediction.
The second part of the talk will present actual programming techniques that can be used to make use of the previously introduced properties, among others: asynchronous programming, strength reduction, tiling, loop unrolling and pipelining, branch elimination, vectorization, mixed precision and specialized algorithms. For each of those we will discuss how C++ templates can help in generalizing and combining those techniques.
Finally we will take a look at some benchmarks to assess how useful those techniques ended up being on particular use cases.