Investigating the limits of extreme micro-optimisation in C++.
After years indulging in the temptations of micro-optimisations, I shall present a definitive, micro-optimiser’s guide to implementing template meta-state machines! This was motivated by the cut-throat arena of High-Frequency Trading (HFT) and my own experiments. Extensive effort has been made to see just how far one can get using extensive template meta-programming to optimise the performance.
Following on from my previous talks regarding optimisations in C++, I undertook a multi-year investigation into extreme micro-optimisation of template meta-state machines. This Herculean effort eventually hit a "hard limit": due to Shannon’s Information Theory… The meta-state machine described was a cut-down variant of the renown Boost.MetaStateMachine (mine lacks guards and other such flexilty). Each transition has been instantiated in a custom-container I termed "unordered_tuple", it provides extremely fast run-time access to each element. The input state has been encoded, using template meta-programming, in the enum tag-value to try to ensure that fastest possible instruction-encoding. The hashing of the state to identify the correct transition in the table uses a custom-designed, data-parallel algorithm that attempts to generate via brute-force an extremely fast, perfect (possibly minimal) hash: this might fail due to inadequate entropy. All this to generate a … computed goto! (Forgive me Dijkstra…) So having erased all type information via the generated computed goto, type-safety was restored via some futher template meta-programming: the "constrained_override_type" that generates a specified override-set of functions and suitable abstract base-class. An analysis of the generated assembler that should demonstrate the efficacy of the techniques shall be given, followed by a performance comparison versus the much more simple "if-else" chain one may have used to implement the lookup of the transition in the meta state-machine.