Marking Benches

Too slow! Russel Winder reminds us that benchmarking code performance is a complicated issue.

In the article ‘Mean Properties’ [Winder17] I, hopefully humorously, but definitely contentiously, stated in a footnote:

Of course we have no real data on this hypothesis without undertaking some sensible benchmarking activity. Which we will not be doing in this article since it is far too much effort.

It may have been too much effort in the context of that article for that article, but clearly measuring performance is extremely important when performance of code execution becomes an issue.

A short rant

Far too regularly I, and I am sure all of you reading this, hear people saying things like “but this code is much faster than that code” and leave it at that as though saying something is the case makes it true ¹ . If you are involved in a conversation with someone making these sort of sweeping judgements, you might want to ask them to rephrase. If the person rewords to something along the lines of “I believe that, on appropriately measuring, this code will execute faster than that code in the measurement context” then you know they initially just mis-phrased things in the heat of the moment, and actually have a reasonable view of software performance. If the person persists with “but this code is much faster than that code, just look at the code” then you may just want to shun the person publicly till they re-educate themselves.

Repairing a previous mis-phrasing

In [Winder17] I stated “…this code is likely to be much slower than using NumPy.” Not an obviously outrageous phrase since it is not claiming a performance fact, but it is raising a performance question, a question that really should be answered.

Questions such as this, as with any hypotheses in scientific method, lead to setting up an experiment to obtain data, experiments that are reproducible and with good statistical backing. In essence this means running the codes a goodly number of times in consistent and stable settings. Doing this sort of experimentation manually is, at least for all programmers I know, seriously tedious, leading either to short-cuts or no experimentation. There have to be frameworks to make this sort of experimentation easy.

The Python approach

In [Winder17] I used [pytest] and [Hypothesis] as frameworks for writing tests for correctness of the implementation code. The goal was to emphasise property-based over example-based testing for correctness. pytest also has support for benchmarking, i.e. checking the performance of code. It is not built in to pytest as distributed, but is a trivially installable plugin [pytest-benchmark] .

pytest-benchmark provides a fixture to be used with test functions to create benchmark code. Whilst benchmarking can (and sometimes is) intermingled with correctness testing code, it is far more normal and idiomatic to separate correctness testing and benchmarking.

As an example let us address the code from the previous article. Here we will just use the Numpy version:

  # python_numpy.py
  import numpy
    mean = numpy.mean

and the corrected pure Python version:

  #python_pure_corrected.py
  def mean(data):
    if len(data) == 0:
      raise ValueError \
        ('Cannot take mean of no data.')
	return sum(data) / len(data)

omitting the original pure Python version. The hypothesis we are testing is that: the Numpy version is faster than the pure Python version.

So how to collect data? We write what appear to be pytest tests but use the benchmark fixture provided by pytest-benchmarking to create benchmark executions of the functions instead of the correctness testing functions we wrote for correctness testing.

In Listing 1, pytest is imported on the assumption that the pytest-benchmark plugin is available. the module random is imported as a dataset has to be generated: data is a random collection of numbers in the range (0.0, 1.0), here 100,000 items are in the dataset. Since both mean function implementation are called mean, the alias feature of the import statement is used to distinguish the Numpy implementation and the pure Python implementation. Then there are the two test functions, which use the benchmark fixture to run the benchmarks of the functions using the dataset provided. (The output has been tabulated – see Table 1 – which omits some information but makes it much easier to read.)

import pytest

from random import random

from python_numpy import mean as mean_numpy
from python_pure_corrected import mean as \
  mean_pure

data = [random() for _ in range(0, 100000)]

def test_mean_numpy(benchmark):
  benchmark(mean_numpy, data)

def test_mean_pure(benchmark):
  benchmark(mean_pure, data)

Listing 1

Name	test_mean_pure	time in µs	test_mean_numpy	time in µs
Min	856.063	1	4461.969	5.21
Max	1079.325	1	4731.414	4.38
Mean	872.459	1	4476.2517	5.13
StdDev	8.2584	1	22.6862	2.75
Median	871.892	1	4472.543	5.13
IQR	1.4317	1	9.9422	6.94
Outliers*	45;342		7;10
Rounds	1055		165
Iterations	1		1

* Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.

Table 1

Well, that was unexpected ² . It appears that we can conclude that the Numpy version is 5-ish times slower given the mean and median values of the time taken to execute the function.

Oh come on, this cannot be…

So we have actual data that pure Python is much faster at calculating means on datasets than Numpy mean. It’s reproducible, and thus incontrovertible. The original hypothesis is disproved ³ .

Yet our expectation remains that pure Python is interpreted Python bytecodes, and thus slow, whereas Numpy is implemented in C and thus fast. It is unbelievable, yet provably true, that pure Python is faster than Numpy.

Ah, but… the input data was a Python list, Numpy works on Numpy arrays. Mayhap we can unpack this idea with a new benchmark. Listing 2 adds constructing a Numpy array form of the data and then running the Numpy mean function on the array. Note we keep the two original tests (see Table 2).

import pytest

from numpy import array
from random import random

from python_numpy import mean as mean_numpy
from python_pure_corrected import mean as \
  mean_pure

data = [random() for _ in range(0, 100000)]
data_as_array = array(data)

def test_mean_numpy_list(benchmark):
  benchmark(mean_numpy, data)
def test_mean_numpy_array(benchmark):
  benchmark(mean_numpy, data_as_array)
def test_mean_pure(benchmark):
  benchmark(mean_pure, data)

Listing 2

Name	test_mean_numpy_array	time in µs	test_mean_pure	time in µs	test_mean_numpy_list	time in µs
Min	100.484	1	841.832	8.38	4572.266	45.5
Max	153.946	1	945.04	6.14	5019.503	32.61
Mean	101.8516	1	848.8424	8.33	4626.392	45.42
StdDev	3.5793	1	9.227	2.58	49.3561	13.79
Median	101.004	1	846.704	8.38	4623.92	45.78
IRQ	0.406	1	0.9293	2.29	44.3635	109.28
Outliers*	324;417		62;308		26;5
Rounds	3314		1011		175
Iterations	1		1		1

* Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.

Table 2

OK, now this looks more like it: pure Python on a Python list is 8-ish times slower than Numpy on a Numpy array. Numpy mean on Python list remains as slow as previously.

We have a new Quasi-Hypothesis

Passing Python data structures to Numpy functions is a really bad idea, thus the mean implementation of the Numpy version is a very poor implementation.

By making a Numpy function appear to be a pure Python function, the conversion of the Python list to a Numpy array is hidden. This conversion clearly has performance implications and so must be made explicit in performance sensitive code. Essentially, when using Numpy, always use Numpy data structures: do not make it convert Python ones.

Proper benchmarking brought this to light. Now to change all the code to avoid the problem. And then benchmark again to make sure the suppositions and hypotheses are vindicated.

Endnote

I have glossed over many important points of data collection, samples, statistics, and normality, in this article – the goal being to enthuse people into using benchmarking frameworks for gathering real, reproducible data to back up claims of performance. Using a framework such as pytestbenchmark has some assumptions built in that arguably may not be formally valid, and so the results presented may not be ‘correct’ in a formal statistical sense. However, the framework gives us data on our code’s performance that is valid enough for us to make deductions, inferences and corrections. Thus it is an incredibly useful tool for ‘rough and ready’ performance checking.

Oh that all programming languages had such useful benchmarking frameworks. Some have, e.g. Python and Java. This is an interesting article on benchmarking frameworks for C++ [Filipek16] .

References

[Filipek16] Micro benchmarking libraries for C++. http://www.bfilipek.com/2016/01/microbenchmarking-libraries-for-c.html

[Hypothesis] http://hypothesis.works/

[pytest] http://pytest.org/

[pytest-benchmark] http://pytest-benchmark.readthedocs.io/en/stable/ https://github.com/ionelmc/pytest-benchmark

[Winder17] ‘Mean Properties’, Overload 137, February 2017. https://accu.org/var/uploads/journals/Overload137.pdf https://accu.org/index.php/journals/2340

OK, as from the second half of 2016 in the new ‘post-truth’ world, the norm in society in general is for things stated to be true to be true, but we must not let software development and programming be polluted by this unscientific posturing.
No-one expects the Spanish Inquisition. It’s a Python thing, a Monty Python thing. https://en.wikipedia.org/wiki/ The_Spanish_Inquisition_(Monty_Python) https://www.youtube.com/watch?v=7WJXHY2OXGE
Technically we need to do some analysis of variance and significance testing to make this statement. However, for the purposes of this article, I shall gloss over this point, even though it pains me to do so: ANOVA, F-tests and t-tests are so much fun.