Testing Propositions

Is testing propositions more important than having examples as exemplars? Russel Winder considers this hypothesis.

With the rise of test-driven development (TDD) in the 1990s as part of the eXtreme Programming (XP) movement, the role of example-based testing became fixed into the culture of software development¹. The original idea was to drive development of software products based on examples of usage of the product by end users. To support this Kent Beck and others at the centre of the XP community created test frameworks. They called them unit test frameworks presumably because they were being used to test the units of code that were being constructed. This all seemed to work very well for the people who had been in on the start of this new way of developing software. But then XP became well-known and fashionable: programmers other than the original cabal began to claim they were doing XP and its tool TDD. Some of them even bought the books written by Kent Beck and others at the centre of the XP community. Some of them even read said books.

Labels are very important. The test frameworks were labelled unit test frameworks. As all programmers know, units are functions, procedures, subroutines, classes, modules: the units of compilation. (Interpreted languages have much the same structure despite not being compiled per se.) Unit tests are thus about testing the units, and the tools for this are unit test frameworks. Somewhere along the line, connection between these tests and the end user scenarios got lost. Testing became an introvert thing. The whole notion of functional testing and ‘end to end’ testing seemed to get lost because the label for the frameworks were ‘unit test’.

After a period of frustration with the lack of connection between end user scenarios and tests, some people developed the idea of acceptance testing so as to create frameworks and workflows. (Acceptance testing has been an integral part of most engineering disciplines for centuries; it took software development a while to regenerate the ideas.) FitNesse [FitNesse] and Robot [Robot] are examples of the sort of framework that came out of this period.

However the distance between acceptance testing and unit testing was still a yawning chasm². Then we get a new entrant into the game, behaviour-driven development (BDD). This was an attempt by Dan North and others to recreate the way of using tests during development. The TDD of XP had lost its meaning to far too many programmers, so the testing frameworks for BDD were called JBehave, Cucumber, etc. and had no concept of unit even remotely associated with them.

Now whilst BDD reasserted the need for programmers and software developers to be aware of end user scenarios and at least pretend to care about user experience whilst implementing systems, we ended up with even more layers of tests and test frameworks.

And then came QuickCheck [QuickCheck], and the world of test was really shaken up: the term ‘property-based testing’ became a thing.

QuickCheck [Hackage] first appeared in work by John Hughes and others in the early 2000s. It started life in the Haskell [Haskell] community but has during the 2010s spread rapidly into the milieus of any programming language that even remotely cares about having good tests.

Example required

Waffling on textually is all very well, but what we really need is code; examples are what exemplify the points, exemplars are what we need. At this point it seems entirely appropriate to make some reuse, which, as is sadly traditional in software development, is achieved by cut and paste. So I have cut and paste³ the following from a previous article for Overload [Winder16]:

For this we need some code that needs testing: code that is small enough to fit on the pages of this august journal, but which highlights some critical features of the test frameworks.

We need an example that requires testing, but that gets out of the way of the testing code because it is so trivial.

We need factorial.

Factorial is a classic example usually of the imperative vs. functional way of programming, and so is beloved of teachers of first year undergraduate programming courses. I like this example though because it allows investigating techniques of testing, and allows comparison of test frameworks.

Factorial is usually presented via the recurrence relation:

f₀ = 1

f_n = nf_n-1

This is a great example, not so much for showing software development or algorithms, but for showing testing⁴, and the frameworks provided by each programming language.

Given the Haskell heritage of property-based testing, it seems only right, and proper, to use Haskell for the first example. (It is assumed that GHC 7.10 or later (or equivalent) is being used.)

Haskell implementation…

There are many algorithms for realizing the Factorial function: iterative, naïve recursive, and tail recursive are the most obvious. So as we see in Listing 1 we have three realizations of the Factorial function. Each of the functions starts with a type signature followed by the implementation. The type signature is arguably redundant since the compiler deduces all types. However, it seems to be idiomatic to have the type signature, not only as documentation, but also as a check that the function implementation is consistent with the stated signature. Note that in Haskell there are no function call parentheses – parentheses are used to ensure correct evaluation of expressions as positional arguments to function calls. It is also important to note that in Haskell functions are always curried: a function of two parameters is actually a function of one parameter that returns a function of one parameter. Why do this? It makes it really easy to partially evaluate functions to create other functions. The code of Listing 1 doesn’t make use of this, but we will be using this feature shortly.

module Factorial(iterative, naïveRecursive,
  tailRecursive) where
exceptionErrorMessage = "Factorial not defined for negative integers."

iterative :: Integer -> Integer
iterative n
    | n < 0 = error exceptionErrorMessage
    | otherwise = product [1..n]

naïveRecursive :: Integer -> Integer
naïveRecursive n
    | n < 0 = error exceptionErrorMessage
    | n == 0 = 1
    | otherwise = n * naïveRecursive (n - 1)

tailRecursive :: Integer -> Integer
tailRecursive n
    | n < 0 = error exceptionErrorMessage
    | otherwise = iteration n 1
    where
      iteration 0 result = result
      iteration i result = iteration (i - 1)
        (result * i)

Listing 1

The iterative and naïveRecursive implementations are just matches with an expression: each match starts with a | and is an expression of Boolean value then a = followed by the result expression to evaluate for that match expression. Matches are tried in order and otherwise is the ‘catch all’ “Boolean” that always succeeds; it should, of course, be the last in the sequence. The error function raises an exception to be handled elsewhere. The tailRecursive function has a match and also a ‘where clause’ which defines the function iteration by pattern matching on the parameters. The ‘where clause’ definitions are scoped to the function of definition⁵,⁶.

…and example-based test

Kent Beck style TDD started in Smalltalk with sUnit⁷ and then transferred to Java with JUnit⁸. A (thankfully fading) tradition seems to have grown that the first test framework in any language is constructed in the JUnit3 architecture – even if this architecture is entirely unsuitable, and indeed not idiomatic, for the programming language. Haskell seem to have neatly side-stepped the problem from the outset since although the name is HUnit [HUnit] as required by the tradition, the architecture is nothing at all like JUnit3. Trying to create the JUnit3 architecture in Haskell would have been hard and definitely not idiomatic, HUnit is definitely idiomatic Haskell.

Listing 2 shows the beginnings of a test using a table driven (aka data driven) approach. It seems silly to have to write a new function for each test case, hence the use of a table (positiveData) to hold the inputs and outputs and create all the tests with a generator (testPositive, a function of two parameters, the function to test and a string unique to the function so as to identify it). The function test takes a list argument with all the tests, here the list is being constructed with a list comprehension: the bit before the | is the value to calculate in each case (a fairly arcane expression, but lets not get too het up about it) and the expression after is the ‘loop’ that drives the creation of the different values, in this case create a list entry for each pair in the table. Then we have a sequence (thanks to the do expression⁹) of three calls to runTestTT (a function of one parameter) which actually runs all the tests.

module Main where

import Test.HUnit

import Factorial

positiveData = [
 (0, 1),
 (1, 1),
 (2, 2),
 (3, 6),
 (4, 24),
 (5, 120),
 (6, 720),
 (7, 5040),
 (8, 40320),
 (9, 362880),
 (10, 3628800),
 (11, 39916800),
 (12, 479001600),
 (13, 6227020800),
 (14, 87178291200),
 (20, 2432902008176640000),
 (30, 265252859812191058636308480000000),
 (40, 815915283247897734345611269596115894272000000000)
]

testPositive function comment =
    test [comment ++ " " ++ show i ~: "" ~:
    expected ~=? function i |
    (i, expected) <- positiveData]

main = do
 runTestTT (testPositive Factorial.iterative
   "Iterative")
 runTestTT (testPositive Factorial.naïveRecursive
   "Naïve Recursive")
 runTestTT (testPositive Factorial.tailRecursive
   "Tail Recursive")

Listing 2

Of course, anyone saying to themselves “but he hasn’t tested negative values for the arguments of the Factorial functions”, you are not being silly; you are being far from silly, very sensible in fact. I am avoiding this aspect of the testing here simply to avoid some Haskell code complexity¹⁰ that adds nothing to the flow in this article. If I had used Python or Java (or, indeed, almost any language other than Haskell) we would not have this issue. For those wishing to see the detail of a full test please see my Factorial repository on GitHub [Winder].

And the proposition is…

The code of Listing 2 nicely shows that what we are doing is selecting values from the domain of the function and ensuring the result of executing the function is the correct value from the image of the function¹¹. This is really rather an important thing to do but are we doing it effectively?

Clearly to prove the implementation is correct we have to execute the code under test with every possible value of the domain. Given there are roughly 2⁶⁴ (about 18,446,744,073,709,551,616) possible values to test on a 64-bit machine, we will almost certainly decide to give up immediately, or at least within just a few femtoseconds. The test code as shown in Listing 2 is sampling the domain in an attempt to give us confidence that our implementation is not wrong. Have we done that here? Are we satisfied? Possibly yes, but could we do more quickly and easily?

The proposition of proposition-based testing is to make propositions about the code and then use random selection of values from the domain to check the propositions are not invalid. In this case of testing the Factorial function, what are the propositions? Factorial is defined by a recurrence relation comprising two rules. These rules describe the property of the function that is Factorial with respect to the domain, the non-negative integers. If we encode the recurrence relation as a predicate (a Boolean valued function) we have a representation of the property that can be tested by random selection of non-negative integers.

Listing 3 shows a QuickCheck test of Factorial. The function f_p is the predicate representing the property being tested. It is a function of two parameters, a function to test and a value to test, with the result being whether the recurrence relation that defines Factorial is true for that value and that function: the predicate is an assertion of the property that any function claiming to implement the Factorial function must satisfy. Why is this not being used directly, but instead factorial_property is the predicate being tested by the calls to quickCheck? It is all about types and the fact that values are automatically generated for us based on the domain of the property being tested. f_p is a predicate dealing with Integer, the domain of the functions being tested, values of which can be negative. Factorial is undefined for negative values¹². So the predicate called by quickCheck, factorial_property, is defined with Natural as the domain, i.e. for non-negative integers¹³. So when we execute quickCheck on the function under test, it is non-negative integer values that are generated: The predicate never needs to deal with negative values, it tests just the Factorial proposition and worries not about handling the exceptions that the implementations raise on being given a negative argument. Should we test for negative arguments and that an exception is generated? Probably. Did I mention ignoring this for now?

module Main where

import Numeric.Natural
import Test.QuickCheck

import Factorial

f_p :: (Integer -> Integer) -> Integer -> Bool
f_p f n
    | n == 0 = f n == 1
    | otherwise = f n == n * f (n - 1)

factorial_property :: (Integer -> Integer) ->
  Natural -> Bool
factorial_property f n = f_p f (fromIntegral n)

main :: IO()
main = do
  quickCheck (factorial_property iterative)
  quickCheck (factorial_property naïveRecursive)
  quickCheck (factorial_property tailRecursive)

Listing 3

Earlier I mentioned currying and partial evaluation. In Listing 3, we are seeing this in action. The argument to each quickCheck call is an expression that partially evaluates factorial_property, binding a particular implementation of Factorial, and returning a function that takes only a Natural value. This sort of partial evaluation is a typical and idiomatic technique of functional programming, and increasingly any language that supports functions as first class entities.

By default QuickCheck selects 100 values from the domain, so Listing 3 is actually 300 tests. In the case we have here there are no fails, all 300 tests pass. Somewhat splendidly, if there is a failure of a proposition, QuickCheck sets about ‘shrinking’ which means searching for the smallest value in the domain for which the proposition fails to hold. Many people are implementing some form of proposition testing in many languages. Any not having shrinking are generally seen as being not production ready. Shrinking is such a boon to taking the results of the tests and deducing (or more usually inferring) the cause of the problem, that it is now seen as essential for any property-based testing framework.

Figure 1 shows the result of running the two test programs: first the HUnit example based testing – 18 hand picked tests for each of the three implementations; and second the QuickCheck property-based testing – 100 tests for each case, all passing so no need for shrinking.

$ ./factorial_test_hunit
Cases: 18  Tried: 18  Errors: 0  Failures: 0
Cases: 18  Tried: 18  Errors: 0  Failures: 0
Cases: 18  Tried: 18  Errors: 0  Failures: 0

$ ./factorial_test_quickcheck
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.
+++ OK, passed 100 tests.

Figure 1

But who uses Haskell?

Well, quite a lot of people. However, one of the major goals of Haskell is to ‘Avoid success at all costs’¹⁴. The point here is not un-sensible. Haskell is a language for exploring and extending ideas and principles of functional programming. The Haskell committee therefore needs to avoid having to worry about backward compatibility. This puts it a bit at odds with many commercial and industrial operations who feel that, once written, a line of code should compile (if that is appropriate) and execute exactly the same for all time without any change. Clearly this can be achieved easily in any language by never upgrading the toolchain. However, the organizations that demand code works for all time usually demand that toolchains are regularly updated. (Otherwise the language is considered dead and unusable. There is irony in here somewhere I believe.) There is no pleasing some people. Successful languages in the sense of having many users clearly have to deal with backward compatibility. Haskell doesn’t. Thus Haskell, whilst being a very important language, doesn’t really have much market traction.

Frege makes an entry

Frege [Frege] though is actually likely to get more traction than Haskell. Despite the potential for having to update codebases, using ‘Haskell on the JVM’ is an excellent way of creating JVM-based systems. And because the JVM is a polyglot platform, bits of systems can be in Java, Frege, Kotlin [Kotlin], Ceylon [Ceylon], Scala [Scala], Apache Groovy [Groovy], etc. For anyone out there using the Java Platform, I can strongly recommend at least trying Frege. To give you a taste, look at Listing 4, which shows three Frege implementations of the Factorial function, and that Frege really is Haskell. The tests (see Listing 5) are slightly different from the Haskell ones not because the languages are different but because the context is: instead of creating a standalone executable as happens with Haskell, Frege create a JVM class to be managed by a test runner. So instead of a main function calling the test executor, we just declare property instances for running using the property function, and assume the test runner will do the right thing when invoked. The three examples here show a different way of constraining the test domain to non-negative integers than we saw with Haskell. Function composition ( . operator, must have spaces either side to distinguish it from member selection) of the property function (using partial evaluation) with a test data generator (NonNegative.getNonNegative; dot as selector not function composition) shows how easy all this can be. Instead of just using the default generator (which would be Integer for this property function factorial_property, we are providing an explicit generator so as to condition the values from the domain that get generated.

module Factorial where

exceptionErrorMessage = "Factorial not defined for negative integers."

iterative :: Integer -> Integer
iterative n
    | n < 0 = error exceptionErrorMessage
    | otherwise = product [1..n]

naïveRecursive :: Integer -> Integer
naïveRecursive n
    | n < 0 = error exceptionErrorMessage
    | n == 0 = 1
    | otherwise = n * naïveRecursive (n - 1)

tailRecursive :: Integer -> Integer
tailRecursive n
    | n < 0 = error exceptionErrorMessage
    | otherwise = iteration n  1
    where
      iteration 0 result = result
      iteration i result = iteration (i - 1)
        (result * i)

Listing 4

module Factorial_Test where

import Test.QuickCheck(quickCheck, property)
import Test.QuickCheckModifiers(NonNegative)

import Factorial(iterative, naïveRecursive,
  tailRecursive)

factorial_property :: (Integer -> Integer) 
  -> Integer -> Bool
factorial_property f n
    | n == 0 = f n == 1
    | otherwise = f n == n * f (n - 1)

factorial_iterative_property = 
  property ((factorial_property iterative) 
  . NonNegative.getNonNegative)
factorial_naïveRecursive_property = 
  property ((factorial_property naïveRecursive) 
  . NonNegative.getNonNegative)
factorial_tailRecursive_property = 
  property ((factorial_property tailRecursive) 
  . NonNegative.getNonNegative)

Listing 5

The result of executing the Frege QuickCheck property-based tests are seen in Figure 2. As with the Haskell, 100 samples for each test with no fails and so no shrinking.

Factorial_Test.factorial_tailRecursive_property:
  +++ OK, passed 100 tests.
Factorial_Test.factorial_iterative_property:
  +++ OK, passed 100 tests.
Factorial_Test.factorial_naïveRecursive_property:
  +++ OK, passed 100 tests.
Properties passed: 3, failed: 0

Figure 2

But…

With Haskell trying not to have a user base reliant on backward compatibility, and Frege not yet having quite enough traction as yet to be deemed popular, it behoves us to consider the proposition of proposition testing in one or more languages that have already gained real traction.

First off let us consider…Python.

Let’s hypothesize Python

Python [Python_1] has been around since the late 1980s and early 1990s. During the 2000s it rapidly gained in popularity. And then there was the ‘Python 2 / Python 3 Schism’.¹⁵ After Python 3.3 was released, there were no excuses for staying with Python 2. (Well, except two – and I leave it as an exercise for the reader to ascertain what these two genuine reasons are for not immediately moving your Python 2 code to Python 3.) For myself, I use Python 3.5 because Python now has function signature type checking [Python_2]¹⁶.

Listing 6 shows four implementations of the Factorial function. Note that the function signatures are advisory not strong type checking. Using the MyPy [MyPy] program the types will be checked, but on execution it is just standard Python as people have known for decades.

from functools import reduce
from operator import mul

def _validate(x: int) -> None:
  if not isinstance(x, int):
    raise TypeError('Argument must be an integer.')
    if x < 0:
      raise ValueError('Argument must be a non-negative integer.')

def iterative(x: int) ->int:
  _validate(x)
  if x < 2:
    return 1
  total = 1
  for i in range(2, x + 1):
    total *= i
  return total

def recursive(x: int) -> int:
  _validate(x)
  return 1 if x < 2 else x * recursive(x - 1)

def tail_recursive(x: int) -> int:
  _validate(x)
  if x < 2:
    return 1
  else:
    def iterate(i: int, result: int=1):
      return result if i < 2 else iterate(i - 1,
        result * i)
    return iterate(x)

def using_reduce(x: int) -> int:
    _validate(x)
    return 1 if x < 2 else reduce(mul, 
      range(2, x + 1))

Listing 6

I suspect the Python code here is sufficiently straightforward that almost all programmers¹⁷ will be able to deduce or infer any meanings that are not immediately clear in the code. But a few comments to help: the range function generates a range ‘from up to but not including’. The if expression is of the form:

  <true-value> if <boolean-expression>
    else <false-value>

The nested function iterate in tail_recursive is scoped to the else block.

But are these implementations ‘correct’? To test them let’s use PyTest [Pytest]. The test framework that comes as standard with Python (unittest, aka PyUnit) could do the job, but PyTest is just better¹⁸. PyTest provides an excellent base for testing but it does not have property-based testing. For this we will use Hypothesis [Hypothesis] (which can be used with PyUnit as easily as with PyTest, but PyTest is just better).

Listing 7 shows a fairly comprehensive test – not only are we testing non-negative and negative integers, we also test other forms of error that are possible in Python. Tests are functions with the first four characters of the name being t, e, s, t. Very JUnit3, and yet these are module-level functions. There are no classes or inheritance in sight: that would be the PyUnit way. The PyTest way is to dispense with the classes as necessary infrastructure, swapping them for needing some infrastructure to be imported in some way or other. (This is all handled behind the scenes when pytest.main executes.) PyTest is in so many ways more Pythonic¹⁹ than PyUnit.

from pytest import mark, raises

from hypothesis import given
from hypothesis.strategies import (integers,
  floats, text)

from factorial import (iterative, recursive,
  tail_recursive, using_reduce)

algorithms = (iterative, using_reduce, recursive,
  tail_recursive)

@mark.parametrize('a', algorithms)
@given(integers(min_value=0, max_value=900))
def test_with_non_negative_integer (a, x):
  assert a(x) == (1 if x == 0 else x * a(x - 1))

@mark.parametrize('a', algorithms)
@given(integers(max_value=-1))
def test_negative_integer_causes_ValueError(a, x):
  with raises(ValueError):
    a(x)

@mark.parametrize('a', algorithms)
@given(floats())
def test_float_causes_TypeError(a, x):
  with raises(TypeError):
    a(x)

@mark.parametrize('a', algorithms)
def test_none_causes_TypeError(a):
  with raises(TypeError):
    a(None)

@mark.parametrize('a', algorithms)
@given(text())
def test_string_causes_TypeError(a, x):
    with raises(TypeError):
        a(x)

if __name__ == '__main__':
  from pytest import main
  main()

Listing 7

PyTest has the @mark.parametrize decorator that rewrites your code so as to have one test per item of data in an iterable. In all the cases here, it is being used to generate tests for each algorithm²⁰.

The @given decorator, which comes from Hypothesis, does not rewrite functions to create new test functions. Instead it generates code to run the function it decorates with a number (the default is 100) of randomly chosen values using the generator given as argument to the decorator, recording the results to report back. This automated data generation is at the heart of property-based testing, and Hypothesis, via the supporting functions such as integers, floats, and text (for generating integers, floats, and string respectively), does this very well. Notice how it is so easy to generate just negative integers or just non-negative integers. Also note the use of the ‘with statement’²¹ and the raises function for testing that code does, in fact, raise an exception.

All the test functions have a parameter a that gets bound by the action of the @mark.parametrize decorator, and a parameter x that gets bound by the action of the @given decorator. This is all very different from the partial evaluation used in Haskell and Frege: different language features lead to different idioms to achieve the same goal. What is Pythonic is not Haskellic/Fregic, and vice versa. At least not necessarily.

The pytest.main function, when executed, causes all the decorators to undertake their work and executes the result. The output from an execution will look very much as in Figure 3. You may find when you try this that the last line is green.²²

============================= test session starts ==============================
platform linux -- Python 3.5.1, pytest-2.9.2, py-1.4.31, pluggy-0.3.1
rootdir: /home/users/russel/Docs/Papers/ACCU/Draft/TestingPropositions/SourceCode/Python, inifile:
plugins: hypothesis-3.4.0, cov-2.2.1
collected 20 items

test_factorial.py ....................

========================== 20 passed in 2.32 seconds ===========================

Figure 3

Doing the C++ thing

There are many other example languages we could present here to show the almost complete coverage of property-based testing in the world: Kotlin [Kotlin], Ceylon [Ceylon], Scala [Scala], Apache Groovy [Groovy], Rust [Rust], D [D], Go [Go],… However, given this is an August²³ ACCU journal and, historically at least, ACCU members have had a strong interest in C++, we should perhaps look at C++. Clearly people could just use Haskell and QuickCheck to test their C++ code, but let’s be realistic here, that isn’t going to happen²⁴. So what about QuickCheck in C++? There are a number of implementations, for example CppQuickCheck [QuickCheck_2] and QuickCheck++ [QuickCheck_3]. I am, though, going to use RapidCheck [RapidCheck] here because it seems like the most sophisticated and simplest to use of the ones I have looked at to date²⁵.

There is one thing we have to note straight away: Factorial values are big²⁶. Factorial of 30 is a number bigger than can be stored in a 64-bit integer. So all the implementations of Factorial used in books and first year student exercises are a bit of a farce because they are shown using hardware integers: the implementations work for arguments [0..20] and then things get worrisome. “But this is true for all languages and we didn’t raise this issue for Haskell, Frege and Python.” you say. Well for Haskell (and Frege, since Frege is just Haskell on the JVM) the Int type is a hardware number but Integer, the type used in the Haskell and Frege code, is an integer type the values of which can be effectively arbitrary size. There is a limit, but then in the end even the universe is finite²⁷. What about Python? The Python²⁸ int type uses hardware when it can or an unbounded (albeit finite²⁷) integer when it cannot. What about C++? Well the language and standard library have only hardware-based types, which could be taken as rather restricting. GNU has however conveniently created a C library for unbounded (albeit finite²⁷) integers, and it has a rather splendid C++ binding [GNU].

So using the GMP C++ API, we can construct implementations of the Factorial function that are not restricted to arguments in the range [0..20] but are more generally useful. Listing 8 shows the functions being exported by the Factorial namespace. We could dispense with the long overloads, but it seems more programmer friendly to offer them.

#include <gmpxx.h>

namespace Factorial {

mpz_class iterative(mpz_class const n);
mpz_class iterative(long const n);
mpz_class reductive(mpz_class const n);
mpz_class reductive(long const n);
mpz_class naive_recursive(mpz_class const n);
mpz_class naive_recursive(long const n);
mpz_class tail_recursive(mpz_class const n);
mpz_class tail_recursive(long const n);

} // namespace Factorial

Listing 8

Listing 9 presents the implementations. I suspect that unless you already know C++ (this code is C++14) you have already moved on. So any form of explanatory note is effectively useless here.²⁹ We will note though that there is a class defined in there as well as implementations of the Factorial function.

#include "factorial.hpp"

#include <functional>
#include <iterator>
#include <numeric>

namespace Factorial {
static void validate(mpz_class const n) {
  if (n < 0) {
    throw std::invalid_argument("Parameter must be a non-negative integer."); }
}
auto const one = mpz_class(1);
auto const two = mpz_class(2);

mpz_class iterative(mpz_class const n) {
  validate(n);
  mpz_class total {1};
  for (unsigned int i = 2; i <= n; ++i) {
    total *= i; }
  return total;
}
mpz_class iterative(long const n) { 
  return iterative(mpz_class(n)); }

class mpz_class_iterator:
  std::iterator<std::input_iterator_tag,
  mpz_class> {
  private:
    mpz_class value;
  public:
    mpz_class_iterator(mpz_class const v) :
      value(v) { }
    mpz_class_iterator& operator++() {
      value += 1; return *this; }
    mpz_class_iterator operator++(int) {
      mpz_class_iterator tmp {
        *this}; ++*this; return tmp; }
      bool operator==(mpz_class_iterator const &
        other) const {
        return value == other.value; }
      bool operator!=(mpz_class_iterator const &
        other) const {
        return value != other.value; }
    mpz_class operator*() const { return  value; }
    mpz_class const * operator->() const {
      return &value; }
};

mpz_class reductive(mpz_class const n) {
  validate(n);
  return (n < 2)
  ? one
  : std::accumulate(mpz_class_iterator(two),
    mpz_class_iterator(n + 1), one,
      std::multiplies<>());}
mpz_class reductive(long const n) { 
  return reductive(mpz_class(n)); }
mpz_class naive_recursive(mpz_class const n) {
  validate(n);
  return (n < 2) ? one :
    n * naive_recursive(n - 1);
}
mpz_class naive_recursive(long const n) { 
  return naive_recursive(mpz_class(n)); }

static mpz_class tail_recursive_iterate
  (mpz_class const n, mpz_class const result) {
  return (n < 2) ? result :
    tail_recursive_iterate(n - 1, result * n);
}

mpz_class tail_recursive(mpz_class const n) {
  validate(n);
  return (n < 2) ? one : tail_recursive_iterate(n,
    one);
}
mpz_class tail_recursive(long const n) {
  return tail_recursive(mpz_class(n)); }
} // namespace Factorial

Listing 9

Listing 10 presents the RapidCheck-based test code for the Factorial functions. There is a vector of function pointers³⁰ so that we can easily iterate over the different implementations. Within the loop we have a sequence of the propositions. Each check has a descriptive string and a lambda function. The type of variables to the lambda function will cause (by default 100) values of that type to be created and the lambda executed for each of them. You can have any number of parameters – zero has been chosen here, which might seem a bit strange at first, but think generating random integers. Some of them are negative and some non-negative and we have to be careful to separate these cases as the propositions are so very different. Also some of the calculation for non-negative integers will result in big values. The factorial of a big number is stonkingly big. Evaluation will take a while… a long while… a very long while… so long we will have read War and Peace… a large number of times. So we restrict the integers of the domain sample by using an explicit generator. In this case for the non-negative integers we sample from [0..900]. For the negative integers we sample from a wider range as there should only ever be a very rapid exception raised, there should never actually be a calculation.

#include "rapidcheck.h"

#include <string>
#include <utility>

#include "factorial.hpp"

std::vector<std::pair<mpz_class (*)(long const), std::string>> const algorithms {
  {Factorial::iterative, "iterative"},
  {Factorial::reductive, "reductive"},
  {Factorial::naive_recursive, "naïve recursive"},
  {Factorial::tail_recursive, "tail recursive"}
};

int main() {
  for (auto && a: algorithms) {
    auto f = a.first;

    rc::check(a.second + " applied to non-negative
      integer argument obeys the recurrence
      relation.", [f]() {
      auto i = *rc::gen::inRange(0, 900);
      RC_ASSERT(f(i) == ((i == 0) ? mpz_class(1) :
        i * f(i - 1)));
    });

    rc::check(a.second + " applied to negative
       integer raises an exception.", [f]() {
      auto i = *rc::gen::inRange(-100000, -1);
      RC_ASSERT_THROWS_AS(f(i),
        std::invalid_argument);
    });
  }
  return 0;
}

Listing 10

So that is the Factorial functions themselves tested. I trust you agree that what we have here is a very quick, easy, and providing good coverage test. But, you ask, what about that class? Should we test the class? An interesting question. Many would say “No” because it is internal stuff, not exposed as part of the API. This works for me: why test anything that is not observable from outside. Others will say “Yes” mostly because it cannot hurt. For this article I say “Yes” because it provides another example of proposition-based testing. We do not test any examples, we test only properties of the class and its member functions. See Listing 11. By testing the properties, we are getting as close to proving the implementation not wrong as it is possible to get in an easily maintainable way. QED.

#include "rapidcheck.h"
#include "factorial.cpp"

int main() {
  using namespace Factorial;

  rc::check("value of operator delivers the right
    value", [](int i) {
      RC_ASSERT(*mpz_class_iterator{i} == i);
    });

  rc::check("pointer operator delivers the right
    value", [](int i) {
      RC_ASSERT(mpz_class_iterator{i}->get_si() 
        == i);
    });

  rc::check("equality is value not identity.",
    [](int i) {
      RC_ASSERT(mpz_class_iterator{i} 
        == mpz_class_iterator{i});
    });

  rc::check("inequality is value not identity.",
    [](int i, int j) {
      RC_PRE(j != 0);
      RC_ASSERT(mpz_class_iterator{i} 
        != mpz_class_iterator{i + j});
    });
  rc::check("preincrement does in fact increment",
    [](int i) {
      RC_ASSERT(++mpz_class_iterator{i} 
        == mpz_class_iterator{i + 1});
    });

  rc::check("postincrement does in fact
    increment", [](int i) {
      RC_ASSERT(mpz_class_iterator{i}++ 
        == mpz_class_iterator{i});
    });

  rc::check("value of preincrement returns correct
    value",  [](int i) {
      RC_ASSERT(*++mpz_class_iterator{i} 
        == i + 1);
  });

  rc::check("value of postincrement returns
    correct value",  [](int i) {
      RC_ASSERT(*mpz_class_iterator{i}++ == i);
  });
}

Listing 11

And to prove that point, see Figure 4, which shows the Factorial tests and class test executed. So many useful (passing) tests, so little effort.

$ ./test_factorial
Using configuration: seed=10731500115167123548

- iterative applied to non-negative integer argument obeys the recurrence relation.
OK, passed 100 tests

- iterative applied to negative integer raises an exception.
OK, passed 100 tests

- reductive applied to non-negative integer argument obeys the recurrence relation.
OK, passed 100 tests

- reductive applied to negative integer raises an exception.
OK, passed 100 tests

- naïve recursive applied to non-negative integer argument obeys the recurrence relation.
OK, passed 100 tests

- naïve recursive applied to negative integer raises an exception.
OK, passed 100 tests

- tail recursive applied to non-negative integer argument obeys the recurrence relation.
OK, passed 100 tests

- tail recursive applied to negative integer raises an exception.
OK, passed 100 tests

$ ./test_mpz_class_iterator
Using configuration: seed=9168594634932513587

- value of operator delivers the right value
OK, passed 100 tests

- pointer operator delivers the right value
OK, passed 100 tests

- equality is value not identity.
OK, passed 100 tests

- inequality is value not identity.
OK, passed 100 tests

- preincrement does in fact increment
OK, passed 100 tests

Figure 4

The message

Example-based testing of a sample from the domain tells us we are calculating the correct value(s). Proposition-based testing tells us that our code realizes the relationships that should exist between different values from the domain. They actually tell us slightly different things and so arguably good tests do both, not one or the other. However if we have chosen the properties to test correctly then zero, one, or two examples are likely to be sufficient to ‘prove’ the code not incorrect. Hypothesis, for example, provides an @example decorator for adding those few examples. For other frameworks in other languages we can just add one or two example-based tests to the property-based tests.

But, some will say, don’t (example-based) tests provide examples of use? Well yes, sort of. I suggest that these examples of use should be in the documentation, that users should not have to descend to reading the tests. So for me property-based testing (with as few examples as needed) is the future of testing. Examples and exemplars should be in the documentation. You do write documentation, don’t you…

An apology

Having just ranted about documentation, you may think I am being hypocritical since the code presented here has no comments. A priori, code without comments, at least documentation comments³¹, is a Bad Thing™ – all code should be properly documentation commented. All the code in the GitHub repository that holds the originals from which the code presented here were extracted is. So if you want to see the properly commented versions, feel free to visit https://github.com/russel/Factorial. If you find any improperly commented code, please feel free to nudge me about it and I will fix it post haste³².

Acknowledgements

Thanks to Fran Buontempo for being the editor of this august³³ journal, especially this August august journal³⁴, and letting me submit a wee bit late.

Thanks to Jonathan Wakely for not laughing too much when I showed him the original C++ code, and for making suggestions that made the code far more sensible.

Thanks to the unnamed reviewers who pointed out some infelicities of presentation as well as syntax. Almost all the syntactic changes have been made – I disagreed with a few. Hopefully the changes made to the content has fully addressed the presentation issues that were raised.

Thanks to all those people working on programming languages and test frameworks, and especially for those working on property-based testing features, without whom this article would have been a very great deal shorter.

References

[Catch] https://github.com/philsquared/Catch

[Ceylon] http://ceylon-lang.org/

[D] http://dlang.org/

[FitNesse] http://www.fitnesse.org/

[Frege] http://www.frege-lang.org or https://github.com/Frege/frege

[GNU] https://gmplib.org/, https://gmplib.org/manual/C_002b_002b-Interface-General.html

[Go] https://golang.org/

[Groovy] http://www.groovy-lang.org/

[Hackage] https://hackage.haskell.org/package/QuickCheck

[Haskell] https://www.haskell.org/

[HUnit] https://github.com/hspec/HUnit

[Hypothesis] http://hypothesis.works/,https://hypothesis.readthedocs.io/en/latest/,https://github.com/HypothesisWorks/hypothesis-python

[Kotlin] http://kotlinlang.org/

[MyPy] http://www.mypy-lang.org/

[Pytest] http://pytest.org/latest/

[Python_1] https://www.python.org/

[Python_2] https://www.python.org/dev/peps/pep-3107/, https://www.python.org/dev/peps/pep-0484/

[QuickCheck] https://en.wikipedia.org/wiki/QuickCheck

[QuickCheck_2] https://github.com/grogers0/CppQuickCheck

[QuickCheck_3] http://software.legiasoft.com/quickcheck/

[RapidCheck] https://github.com/emil-e/rapidcheck

[Robot] http://robotframework.org/

[Rust] https://www.rust-lang.org/

[Scala] http://www.scala-lang.org/

[Winder] The full Haskell example can be found at https://github.com/russel/Factorial/tree/master/Haskell.

[Winder16] Overload, 24(131):26–32, February 2016. There are PDF (https://accu.org/journals/overload/24/131/overload131.pdf#page=27) or HTML (https://accu.org/journals/overload/24/131/winder_2203/) versions available.

[Web Editor's Note: Some URLs have been changed from the original article.]

Footnotes

Into the culture of cultured developers, anyway.
Yes there is integration testing and system testing as well as unit testing and acceptance testing, and all this has been around in software, in principle at least, for decades, but only acceptance testing and unit testing had frameworks to support them. OK, technically FitNesse is an integration testing framework, but that wasn’t how it was being used, and not how it is now advertised and used.
Without the footnotes, so if you want those you’ll have to check the original. We should note though that unlike that article of this august journal, this is an August august journal issue, so very august.
OK so in this case this is unit testing, but we are creating APIs which are just units so unit testing is acceptance testing for all intents and purposes.
If you need a tutorial introduction to the Haskell programming language then http://learnyouahaskell.com/ and http://book.realworldhaskell.org/ are recommended.
If you work with the JVM and want to use Haskell, there is Frege; see http://www.frege-lang.org or https://github.com/Frege/frege Frege is a realization of Haskell on the JVM that allows a few extensions to Haskell so as to work harmoniously with the Java Platform.
The name really does give the game away that the framework was for unit testing.
Initially called JUnit, then when JUnit4 came out JUnit was renamed JUnit3 as by then it was at major version 3. Now of course we have JUnit5.
Yes it’s a monad. Apparently monads are difficult to understand, and when you do understand them, they are impossible to explain. This is perhaps an indicator of why there are so many tutorials about monads on the Web.
Involving Monads. Did I mention about how once you understand monads, you cannot explain them?
Pages such as https://en.wikipedia.org/wiki/Domain_of_a_function and https://en.wikipedia.org/wiki/Image_(mathematics) may be handy if you are unused to the terminology used here.
And also non-integral types, do not forget this in real testing.
If you are thinking we should be setting up a property to check that all negative integers result in an error, you are thinking on the right lines.
A phrase initially spoken by Simon Peyton Jones a number of years ago that caught on in the Haskell community.
We will leave any form of description and commentary on the schism to historians. As Python programmers, we use Python 3 and get on with programming.
This isn’t actually correct: Python allows function signatures as of 3.5 but doesn’t check them. You have to have to have a separate parser-type-checker such as MyPy. This is annoying, Python should be doing the checking.
We will resist the temptation to make some facetious, and likely offensive, comment about some programmers who use only one programming language and refuse to look at any others. “Resistance is futile.” Seven of Nine.
For reasons that may, or may not, become apparent in this article, but relate to PyUnit following JUnit3 architecture – remember the fading tradition – and PyTest being Pythonic.
See http://docs.python-guide.org/en/latest/writing/style/
There are ways of parameterizing tests in PyUnit (aka unittest), but it is left as an exercise for the reader to look for these. PyTest and @pytest.mark.parametrize are the way this author chooses to do parameterized tests in Python.
Context managers and the ‘with statement’ are Python’s way of doing RAII (resource acquisition is initialization, https://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) amongst other great things.
Whilst this is an August august journal (and so very august), it is monochrome. So you will have to imagine the greenness of the test output. Either that or actually try the code out for yourself and observe the greenness first hand.
Or should that be august. Well actually it has to be both.
Not least because Haskell’s avowed aim is never to be successful.
Also it uses Catch [Catch] for its tests.
Factorials are big like space is big, think big in Hitchhiker’s Guide to the Galaxy terms: “Space,” it says, “is big. Really big. You just won’t believe how vastly, hugely, mindbogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Listen…?” https://en.wikiquote.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy
Space may be big (see above) but the universe (space being the same thing as the universe as far as we know) is finite – assuming the current theories are correct.
Python 3 anyway. Python 2 has effectively the same behaviour, but with more types. It is left as an exercise for the reader whether to worry about this.
There was some thought of introducing the acronym RTFC (read the fine code), but this temptation was resisted. “Resistance is futile.” Seven of Nine.
Well, actually pairs, with the first being the function pointer and the second being a descriptive string.
Debating the usefulness or otherwise of non-documentation comments is left as an exercise for the readership.
And request Doctor Who or someone to perform appropriate time travel with the corrections so that the situation has never been the case.
And, indeed, August.
“This joke is getting silly, stop this joke immediately.” The Colonel.

Russel was a long-standing ACCU member who passed away earlier this year. We miss him and have reprinted this article in his memory, which was first published in Overload 134, August 2016.

Russel Winder Ex-theoretical physicist, ex-UNIX system programmer, ex-academic. Now an ex-independent consultant, ex-analyst, ex-author, ex-expert witness and ex-trainer. Was interested in all things parallel and concurrent. And build. Was actively involved with Groovy, GPars, GroovyFX, SCons, and Gant. Also Gradle, Ceylon, Kotlin, D and bit of Rust. And lots of Python especially Python-CSP..