ACCU Home page ACCU Conference Page
Search Contact us ACCU at Flickr ACCU at GitHib ACCU at Facebook ACCU at Linked-in ACCU at Twitter Skip Navigation

pinQuick Modular Calculations (Part 1)

Overload Journal #154 - December 2019 + Programming Topics   Author: Cassio Neri
Compilers are good at optimising modular calculations. Can we they do better? Cassio Neri shows they can.

Searching for ‘fast divisibility’ on Stack Overflow indicates that the question of whether hand-written C/C++ code can be faster than using the % operator fosters the curiosity of a few developers. Some answers basically state “Don’t do it. Trust your compiler.” I could not agree more with the principle. Compilers do a great job of optimising code and programmers should favour clarity. Nevertheless, this series shows ways to get improved performance when evaluating modular expressions. The intention here is not to ‘beat’ the compiler. On the contrary, this series is an open letter addressed to compiler writers presenting some algorithms that, potentially, could be incorporated into their product for the benefit of all programmers. Performance analysis shows that the alternatives discussed in this series are often faster than built-in implementations. With the exception of one algorithm which (to the best of my knowledge) is original, the others are not. Indeed, they have been covered by the classic Hacker’s Delight [Warren] for more than a couple of years. Nevertheless, major compilers still do not implemented them.

Warm up

We start looking at an example of the improvement that can be achieved. Figure 11 graphs the time it takes to check whether each element of an array of 65536 uniformly distributed unsigned 32-bits dividends in the interval [0, 1000000] leaves a remainder 3 when divided by 14.

built_in<14>                       minverse<14>                                 Noop
ratio (CPU time / Noop time)
Lower is faster
Figure 1

This is performed by two in-line functions: the first one, labelled built_in, contains a single return statement:

  return n % 14 == 3;

The one labelled minverse implements the modular inverse algorithm (minverse for short) which is this article’s subject. Both measurements include the time to scan the array of dividends without performing any modular calculation. This time is also plotted against the label Noop and it is used as unit of time. The times2 taken by minverse and the built-in algorithms are, respectively, 2.74 and 3.70. Adjusted times (obtained by subtracting the array scanning time) are then 2.74 - 1.00 = 1.74 and 3.70 - 1.00 = 2.70, which implies a ratio of 1.74 / 2.70 = 0.64. This represents a respectable 36% performance improvement with respect to the built-in algorithm. This ratio largely depends on the divisor, as will be made clear later on in this article.

Listings 1 and 2 contrast the code generated by GCC 8.2.1 (optimisation level at -O3) for the two algorithms. Although code size is not a perfect indication of speed, these two aspects are highly correlated.

built_in
   0:   89 fa             mov    %edi,%edx
   2:   b9 93 24 49 92    mov    $0x92492493,%ecx
   7:   d1 ea             shr    %edx
   9:   89 d0             mov    %edx,%eax
   b:   f7 e1             mul    %ecx
   d:   c1 ea 02          shr    $0x2,%edx
  10:   6b d2 0e          imul   $0xe,%edx,%edx
  13:   29 d7             sub    %edx,%edi
  15:   83 ff 03          cmp    $0x3,%edi
  18:   0f 94 c0          sete   %al
  1b:   c3                retq  
			
Listing 1
minverse
 0: 83 ef 03          sub   $0x3,%edi
 3: 69 ff b7 6d db b6 imul  $0xb6db6db7,%edi,%edi
 9: d1 cf             ror   %edi
 b: 81 ff 92 24 49 12 cmp   $0x12492492,%edi
11: 0f 96 c0          setbe %al
14: c3                retq
			
Listing 2

Preliminaries

This series of articles concerns the evaluation of modular expressions where the divisor is a compile time constant and the dividend is a runtime variable. Dividend and divisor have the same unsigned integer type.

The algorithms were implemented in a C++ library called qmodular [qmodular], short for quick modular. They can be implemented in standard C/C++ but qmodular uses a few GCC extensions including inline assembly. Our discussion focuses on GCC 8.2.1 for an x86_64 target but it is applicable to other compilers and architectures.

A fundamental reason for the currently built-in algorithm not being the most performant is that often it needlessly calculates remainders. For instance, n % d == m % d is, roughly speaking (more on this later), equivalent to (n - m) % d == 0. In this approach, there is no need to compute either n % d or m % d. However, the compiler evaluates both and ends up doing two modular calculations rather than one.

The built-in algorithm

For any given divisor, the compiler selects the best algorithm it knows for this particular value. Undoubtedly, the best case is where d is a power of two and n % d is evaluated as n & (d – 1). This is a classic bitwise trick [Warren]. For other divisors, n % d is evaluated by the equivalent expression n - (n / d) * d. (Recall that / is integer division.) Hence, n % d == r becomes n – (n / d) * d == r.

The most important and widely implemented optimisation replaces the expensive division with a multiplication (instruction mul in Listing 1) by a ‘magic’ number (the constant 0x92492493) and other cheap operations. Additional more trivial micro-optimisations are also applied. Two examples of such micro-optimisations follow.

When r is known to be zero at compile time, the subtraction is avoided and the expression is evaluated as n == (n / d) * d.

As mentioned earlier, when evaluating n % d == m % d, the compiler computes both remainders and, for each of them, performs a multiplication by the ‘magic’ number. Rather than twice loading this number into a register, this is done only once.

The modular inverse algorithm

This section covers the basics of the minverse algorithm by means of an example. A more detailed exhibition, including a mathematical proof of correctness, is provided in [qmodular]. We are interested in the case where n, d and r are unsigned integer numbers usually of 32 or 64 bits. For easy of exposition, we assume the number of bits is just 4 and hence, n ϵ Ω4 = {0, …, 15}, which is a set small enough to allow manual inspection of all possible dividends. For this example we set the divisor d = 6.

Any strictly positive integer is a product of a positive odd number and a power of two. The algorithm starts by finding this decomposition for the divisor. In our example, d = 6 = 3∙21 and we set h = 3 and k = 1 (the exponent of 2). Since h is odd, by well known arithmetical results, there exists a unique g ϵ Ω4 such that gh ≡ 1 (mod 24). Indeed, 11∙3 = 33 = 2∙24 + 1 and thus g = 11. More generally, the modulus can be any power of two, 2w. (Again, the cases of practical interest are w = 32 and w = 64 but here we take w = 4.) Numbers g and h are said to be modular inverse (modulo 2w) of one another and hence the algorithm’s name.

For any n ϵ Ωw = {0, …, 2w - 1}, let rorw(n, k) ϵ Ωw be obtained by rotating the bits of n by k positions to the right. For instance, ror4(3, 2) = 12 since 3 = (0011)2 and 12 = (1100)2.

Given n ϵ Ωw, the algorithm proceeds with the calculation of rorw(gn, k). Table 1 shows detailed steps to get ror4(11∙n, 1) for every n ϵ Ω4. (Rows are ordered by the last column.) One can easily inspect two crucial properties of the map nrorw(g∙n, k): it is bijective and for multiples of d it matches division by d, that is, rorw(gdi, k) = i. (See first three rows.) This map has other interesting3 properties but the two cited here are enough for the algorithm to work.

n 11∙n ( mod 24 ) ( 11∙n )2 ( ror( 11∙n, 1 ) )2 ror(11∙n, 1 )
0 0 0000 0000 0
6 2 0010 0001 1
12 4 0100 0010 2
2 6 0110 0011 3
8 8 1000 0100 4
14 10 1010 0101 5
4 12 1100 0110 6
10 14 1110 0111 7
3 1 0001 1000 8
9 3 0011 1001 9
15 5 0101 1010 10
5 7 0111 1011 11
11 9 1001 1100 12
1 11 1011 1101 13
7 13 1101 1110 14
13 15 1111 1111 15
Table 1

Let Nr be the number of elements of Ωw that are equivalent to r (mod d) and n = di + r, with 0 ≤ i < Nr, be one of them. Then n - r = id and we obtain rorw(g∙(n - r), k) = i < Nr. Reciprocally, if n ϵ Ωw is not equivalent to r (mod d), then n - r is not multiple of d and rorw(g∙(n - r), k) ≥ Nr.

In our example d = 6. Take, for instance, r = 2 and note the numbers in Ω4 that leave this remainder when divided by d = 6, namely, 2, 8 and 14. From this we get Nr = 3. Subtracting r = 2 from these three numbers yields 0, 6 and 12, respectively. They are mapped by nrorw(gn, k) into 0, 1 and 2, which are exactly the elements of Ω4 smaller than Nr = 3.

In summary, given d = h∙2k, with h odd and 0 ≤ r < d, we have nr (mod d), if and only if rorw(g∙(nr), k) < Nr, where g is the modular inverse of h (mod 2w) and Nr is the number of elements of {0, …, 2w - 1} that are equivalent to r (mod d). Notice that g and k depend on d only and can be computed at compile time. However, Nr depends on d and r. It can be known at compile time provided that r is.

At this point, Listing 2 becomes much less cryptic as the code generated for

  unsigned k = 1, g = 0xb6db6db7, r = 3,
    N = 0x12492493;
  return ror(g * (n – r), k) <= N - 1;

The snippet above calls the function ror, defined in qmodular, which is translated to the assembly instruction with the same name.

As in the built-in case, micro-optimisations can be applied. For instance, when r == 0 the subtraction from n can be elided. The same holds for the multiplication, when g == 1, and for ror, when k == 0. These optimisations are performed by the compiler and we should not worry about doing them ourselves. However, there is another trick that the compiler cannot figure out by itself. It is based on a property of our map that we have not yet used. To each divisor, there corresponds a special remainder s ≠ 0 for which the subtraction can be eliminated provided the comparison is reversed. More precisely, ns (mod d), if and only if rorw(gn, k) ≥ 2w - Nr. For instance, in Table 1 we see the three numbers leaving remainder 1, namely, 1, 7 and 13 are mapped, respectively, into 13, 14 and 15. Hence, the remainder is 1, if and only if the result is greater than or equal to 13 = 16 - 3 = 2w - Nr.

For evaluation of n % d == m % d, we use the fact that nm (mod d), if and only if n - m ≡ 0 (mod d). Because unsigned types implement modulo 2w arithmetic (not modulo d) extra care must be taken for the subtraction not to underflow. Hence, we apply the test to either n - m or m - n depending on whether nm or not, that is, n % d == m % d can be evaluated as follows:

  ror(g * (n >= m ? n - m : m - n), k) < N;

In reality, digressing from the modular inverse algorithm, this last reasoning equally applies to the built-in algorithm:

  (n >= m ? n - m : m - n) % d == 0;

This way, the compiler calculates just one remainder instead of two as we have mentioned earlier. For the sake of completeness, the performance analysis includes the expression above.

The unbounded case

An implicit assumption of the previous section is that remainders are bounded, r < d, otherwise the modular inverse algorithm does not work. Since this condition does not always hold, qmodular implements two functions similar to those in Listing 3. The first one is as in the previous section and assumes r < d. The second function tests the condition and delegates the call to the first one.

bool
modular_inverse_bounded(unsigned n, unsigned r)
{
  /* ... */ 
}

bool
modular_inverse_unbounded(unsigned n, unsigned r)
{ 
  return r < d & modular_inverse_bounded(n, r); 
}
			
Listing 3

When r is known at compile time the compiler can figure out by itself whether r < d or not. In the affirmative case, the test is removed altogether and the second function simply calls and returns the result of the first one. Similarly, the negative case is reduced to return false.

Some readers might be thinking the & (bitwise and) in Listing 3 is a typo and && (logical and) was intended. In reality, this is intentional. Due to short circuiting, && yields a branch and performance suffers accordingly. Since computations in modular_inverse_bounded are cheap, it is much more efficient to carry them out rather then aborting evaluation when the first condition is false.

Performance analysis

As in the warm up, all measurements shown in this section concern the evaluation of modular expressions for 65536 uniformly distributed unsigned 32-bits dividends in the interval [0, 1000000]. Remainders can be either fixed at compile time or variable at runtime. Charts show divisors are on the x-axis and time measurements, in nanoseconds, on the y-axis. Timings are adjusted to account for the time of array scanning.

For clarity we restrict divisors to [1, 50] which suffices to spot trends. (Results for divisors up to 1000 are available in [qmodular].) In addition, we filter out divisors that are powers of two since the bitwise trick is undoubtedly the best algorithm for them. The timings were obtained with the help of Google Benchmark [Google] running on an AMD Ryzen 7 1800X Eight-Core Processor @ 3600Mhz; caches: L1 Data 32K (x8), L1 Instruction 64K (x8), L2 Unified 512K (x8), L3 Unified 8192K (x2).

Figure 2 concerns the evaluation of n % d == 0 and plots times taken by the built-in and minverse algorithms. The latter is clearly faster than the former. Minverse’s pretty regular zigzag is due to the microoptimisation that removes ror, if and only if k = 0. That is, ror is used for all even divisors and only for them.

Figure 2

Figure 3 covers n % d == r where r is variable and uniformly distributed in [0, d). Despite the validity of precondition r < d, the compiler cannot know this and the boundness check is kept in the assembly code. Compared to the case of a fixed remainder, the built-in’s performance barely changes whereas minverse becomes visibly slower. This is due to the computation of Nr now being performed at runtime.

Figure 3

For most divisors the built-in algorithm is faster than minverse. There are exceptions, though. Such divisors, up to 50, can be clearly seen in the chart and are 7, 19, 21, 27, 31, 35, 37, 39 and 45. The blame lies in the ‘magic’ number. For these divisors, corresponding ‘magic’ numbers do not fit in 32-bits registers. Consequently, they are truncated and more instructions are needed to correct the result. (I would like to point out to an interesting work [ridiculous_fish] which suggests alternative ways of dealing with such divisors. An idea very much worth exploring.)

Finally, Figure 4 considers the expression n % d == m % d where both n and m are variable. As announced earlier, we also consider the variation that uses the absolute difference of n and m:

Figure 4
  (n >= m ? n – m : m – n) % d == 0;

This variation is labelled built_in_distance. Comparing this to the plain built-in algorithm we can see how effective this simple optimisation is. Nevertheless, minverse is still the fastest (it also uses the trick above).

Good and bad news regarding GCC 9.1

Starting at version 9.1, GCC implements the modular inverse algorithm. This welcome fact can be easily verified in [godbolt] by comparing the code generated for the following snippet with the code shown in Listing 2:

  bool f(unsigned n) {
    return n % 14 == 3;
  }

With GCC 8.2 for a x86-64 machine we obtain the code in Listing 4, and using GCC 9.1, we obtain the code in Listing 5.

movl %edi, %edx
movl $-1840700269, %ecx
shrl %edx
movl %edx, %eax
mull %ecx
shrl $2, %edx
imull $14, %edx, %edx
subl %edx, %edi
cmpl $3, %edi
sete %al
ret
			
Listing 4
imull $-1227133513, %edi, %edi
subl $613566757, %edi
rorl %edi
cmpl $306783378, %edi
setbe %al
ret
			
Listing 5

A small difference between qmodular’s implementation described here and GCC’s is worth mentioning because of its performance implications: while qmodular computes g∙(n - r), GCC computes gn - gr. It would be very costly to compute gr at runtime and, therefore, GCC only selects the minverse algorithm when the remainder is a compile time constant. (Even though and as we have seen, for some divisors minverse can beat the classic built-in algorithm.) Second, when r is a compile time constant it appears as an immediate value in the assembly code seen in Listing 2. In GCC’s implementation we see gr instead. Now, for (not so) small divisors and dividends (which is probably the most common case in practice), r is also small but gr is large. As it turns out, there are limitations on 64-bit immediate values which might force GCC’s implementation to use more instructions and registers.

Unfortunately, there seems to be yet another other issue with GCC’s implementation of minverse. For instance, for reasons unknown to me, changing the remainder from 3 to 4 is enough for the generated assembly to fall back to the old algorithm. Finally, regarding n % d == m % d, GCC does not use minverse at all and, instead, it computes the two remainders and compares them.

Conclusion

This article presents the modular inverse algorithm that can be used to evaluate expressions n % d == r and n % d == m % d. This algorithm is not new [Warren] but has not been widely implemented by compilers. Version 9.1 of GCC does implement it but there is room for further improvements. This work is intended to help compiler writers consider when they should use the modular inverse algorithm rather than what they currently use.

The greatest drawback regarding the minverse algorithm is that, to the best of my knowledge, it can not be efficiently used for other types of expressions like n % d < r or n % d ≤ r. These expressions and alternative algorithms for their evaluations are covered in Part II and Part III of this series.

Acknowledgements

I am deeply thankful to Fabio Fernandes for the incredible advice he provided during the research phase of this project. I am equally grateful to the Overload team for helping improve the manuscript.

References

[godbolt] https://godbolt.org/z/a7wr3U

[Google] https://github.com/google/benchmark

[qb] http://quick-bench.com/mhIaqB1ZvsBVROTGaO9opgx5ZTE

[qmodular] https://github.com/cassioneri/qmodular

[ridiculous_fish] ridiculous_fish, Labor of Division (Episode III): Faster Unsigned Division by Constants, October 19th, 2011, available at:http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html

[Warren] Henry S. Warren, Jr., Hacker's Delight, Second Edition, Addison Wesley, 2013

  1. Powered by quick-bench.com [qb]. For readers who are C++ programmers and do not know this site, I strongly recommend to check it out. In addition, I politely ask all readers to consider contributing to the site to keep it running. (Disclaimer: apart from being a regular user and donor, I have no other affiliation with this site.)
  2. YMMV, reported numbers were obtained by a single run in quick-bench.com using GCC 8.2 with -O3 and -std=c++17.
  3. More generally, for any r (not necessarily zero), numbers leaving remainder r are strictly increasingly mapped into (disjoint) intervals. For instance, 1, 7 and 13 leave remainder 1 and are mapped to {13, 14, 15}, whereas 4 and 10 leave remainder 4 and are mapped into {6, 7}. Boundaries from shaded to unshaded rows mark remainder changes.

Cassio Neri has a PhD in Applied Mathematics from Université de Paris Dauphine. He worked as a lecturer in Mathematics before moving to the financial industry.

Overload Journal #154 - December 2019 + Programming Topics