Marking Benches

Marking Benches

By Russel Winder

Overload, 25(141):5-7, October 2017

Too slow! Russel Winder reminds us that benchmarking code performance is a complicated issue.

In the article ‘Mean Properties’ [Winder17] I, hopefully humorously, but definitely contentiously, stated in a footnote:

Of course we have no real data on this hypothesis without undertaking some sensible benchmarking activity. Which we will not be doing in this article since it is far too much effort.

It may have been too much effort in the context of that article for that article, but clearly measuring performance is extremely important when performance of code execution becomes an issue.

A short rant

Far too regularly I, and I am sure all of you reading this, hear people saying things like “but this code is much faster than that code” and leave it at that as though saying something is the case makes it true 1 . If you are involved in a conversation with someone making these sort of sweeping judgements, you might want to ask them to rephrase. If the person rewords to something along the lines of “I believe that, on appropriately measuring, this code will execute faster than that code in the measurement context” then you know they initially just mis-phrased things in the heat of the moment, and actually have a reasonable view of software performance. If the person persists with “but this code is much faster than that code, just look at the code” then you may just want to shun the person publicly till they re-educate themselves.

Repairing a previous mis-phrasing

In [Winder17] I stated “…this code is likely to be much slower than using NumPy.” Not an obviously outrageous phrase since it is not claiming a performance fact, but it is raising a performance question, a question that really should be answered.

Questions such as this, as with any hypotheses in scientific method, lead to setting up an experiment to obtain data, experiments that are reproducible and with good statistical backing. In essence this means running the codes a goodly number of times in consistent and stable settings. Doing this sort of experimentation manually is, at least for all programmers I know, seriously tedious, leading either to short-cuts or no experimentation. There have to be frameworks to make this sort of experimentation easy.

The Python approach

In [Winder17] I used [pytest] and [Hypothesis] as frameworks for writing tests for correctness of the implementation code. The goal was to emphasise property-based over example-based testing for correctness. pytest also has support for benchmarking, i.e. checking the performance of code. It is not built in to pytest as distributed, but is a trivially installable plugin [pytest-benchmark] .

pytest-benchmark provides a fixture to be used with test functions to create benchmark code. Whilst benchmarking can (and sometimes is) intermingled with correctness testing code, it is far more normal and idiomatic to separate correctness testing and benchmarking.

As an example let us address the code from the previous article. Here we will just use the Numpy version:

  import numpy
    mean = numpy.mean

and the corrected pure Python version:
  def mean(data):
    if len(data) == 0:
      raise ValueError \
        ('Cannot take mean of no data.')
	return sum(data) / len(data)

omitting the original pure Python version. The hypothesis we are testing is that: the Numpy version is faster than the pure Python version.

So how to collect data? We write what appear to be pytest tests but use the benchmark fixture provided by pytest-benchmarking to create benchmark executions of the functions instead of the correctness testing functions we wrote for correctness testing.

In Listing 1, pytest is imported on the assumption that the pytest-benchmark plugin is available. the module random is imported as a dataset has to be generated: data is a random collection of numbers in the range (0.0, 1.0), here 100,000 items are in the dataset. Since both mean function implementation are called mean, the alias feature of the import statement is used to distinguish the Numpy implementation and the pure Python implementation. Then there are the two test functions, which use the benchmark fixture to run the benchmarks of the functions using the dataset provided. (The output has been tabulated – see Table 1 – which omits some information but makes it much easier to read.)

import pytest

from random import random

from python_numpy import mean as mean_numpy
from python_pure_corrected import mean as \

data = [random() for _ in range(0, 100000)]

def test_mean_numpy(benchmark):
  benchmark(mean_numpy, data)

def test_mean_pure(benchmark):
  benchmark(mean_pure, data)
Listing 1
Name test_mean_pure time in µs test_mean_numpy time in µs
Min 856.063 1 4461.969 5.21
Max 1079.325 1 4731.414 4.38
Mean 872.459 1 4476.2517 5.13
StdDev 8.2584 1 22.6862 2.75
Median 871.892 1 4472.543 5.13
IQR 1.4317 1 9.9422 6.94
Outliers* 45;342 7;10
Rounds 1055 165
Iterations 1 1

* Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.

Table 1

Well, that was unexpected 2 . It appears that we can conclude that the Numpy version is 5-ish times slower given the mean and median values of the time taken to execute the function.

Oh come on, this cannot be…

So we have actual data that pure Python is much faster at calculating means on datasets than Numpy mean. It’s reproducible, and thus incontrovertible. The original hypothesis is disproved 3 .

Yet our expectation remains that pure Python is interpreted Python bytecodes, and thus slow, whereas Numpy is implemented in C and thus fast. It is unbelievable, yet provably true, that pure Python is faster than Numpy.

Ah, but… the input data was a Python list, Numpy works on Numpy arrays. Mayhap we can unpack this idea with a new benchmark. Listing 2 adds constructing a Numpy array form of the data and then running the Numpy mean function on the array. Note we keep the two original tests (see Table 2).

import pytest

from numpy import array
from random import random

from python_numpy import mean as mean_numpy
from python_pure_corrected import mean as \

data = [random() for _ in range(0, 100000)]
data_as_array = array(data)

def test_mean_numpy_list(benchmark):
  benchmark(mean_numpy, data)
def test_mean_numpy_array(benchmark):
  benchmark(mean_numpy, data_as_array)
def test_mean_pure(benchmark):
  benchmark(mean_pure, data)
Listing 2
Name test_mean_numpy_array time in µs test_mean_pure time in µs test_mean_numpy_list time in µs
Min 100.484 1 841.832 8.38 4572.266 45.5
Max 153.946 1 945.04 6.14 5019.503 32.61
Mean 101.8516 1 848.8424 8.33 4626.392 45.42
StdDev 3.5793 1 9.227 2.58 49.3561 13.79
Median 101.004 1 846.704 8.38 4623.92 45.78
IRQ 0.406 1 0.9293 2.29 44.3635 109.28
Outliers* 324;417 62;308 26;5
Rounds 3314 1011 175
Iterations 1 1 1

* Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.

Table 2

OK, now this looks more like it: pure Python on a Python list is 8-ish times slower than Numpy on a Numpy array. Numpy mean on Python list remains as slow as previously.

We have a new Quasi-Hypothesis

Passing Python data structures to Numpy functions is a really bad idea, thus the mean implementation of the Numpy version is a very poor implementation.

By making a Numpy function appear to be a pure Python function, the conversion of the Python list to a Numpy array is hidden. This conversion clearly has performance implications and so must be made explicit in performance sensitive code. Essentially, when using Numpy, always use Numpy data structures: do not make it convert Python ones.

Proper benchmarking brought this to light. Now to change all the code to avoid the problem. And then benchmark again to make sure the suppositions and hypotheses are vindicated.


I have glossed over many important points of data collection, samples, statistics, and normality, in this article – the goal being to enthuse people into using benchmarking frameworks for gathering real, reproducible data to back up claims of performance. Using a framework such as pytestbenchmark has some assumptions built in that arguably may not be formally valid, and so the results presented may not be ‘correct’ in a formal statistical sense. However, the framework gives us data on our code’s performance that is valid enough for us to make deductions, inferences and corrections. Thus it is an incredibly useful tool for ‘rough and ready’ performance checking.

Oh that all programming languages had such useful benchmarking frameworks. Some have, e.g. Python and Java. This is an interesting article on benchmarking frameworks for C++ [Filipek16] .


[Filipek16] Micro benchmarking libraries for C++.




[Winder17] ‘Mean Properties’, Overload 137, February 2017.

  • OK, as from the second half of 2016 in the new ‘post-truth’ world, the norm in society in general is for things stated to be true to be true, but we must not let software development and programming be polluted by this unscientific posturing.
  • No-one expects the Spanish Inquisition. It’s a Python thing, a Monty Python thing. The_Spanish_Inquisition_(Monty_Python)
  • Technically we need to do some analysis of variance and significance testing to make this statement. However, for the purposes of this article, I shall gloss over this point, even though it pains me to do so: ANOVA, F-tests and t-tests are so much fun.

Your Privacy

By clicking "Accept Non-Essential Cookies" you agree ACCU can store non-essential cookies on your device and disclose information in accordance with our Privacy Policy and Cookie Policy.

Current Setting: Non-Essential Cookies REJECTED

By clicking "Include Third Party Content" you agree ACCU can forward your IP address to third-party sites (such as YouTube) to enhance the information presented on this site, and that third-party sites may store cookies on your device.

Current Setting: Third Party Content EXCLUDED

Settings can be changed at any time from the Cookie Policy page.