Adding bit vectors - Branchless Comparisons

Posted on February 22, 2019 by John Ky

In a previous post, I described how addition can be used flip runs of bits, but I left the method of adding large bit-vectors unexplained.

This post will fill this gap by explaining how to add large bit-vectors and how to do so efficiently.

Examples in this post will work with examples that use vectors of Word8 in order to be concise in the explanation, whereas the real implementation will use vectors of Word64 for efficiency.

As per usual, Words when expressed in binary form are expressed in Little Endian.

Overflows

CPUs perform addition on registers of finite size. We shall describe the register in question as having n bits indexed from 0 to n - 1. When adding large integers, there inevitably comes a point where the resulting integer requires more than n bits to store and will not fit in the register and an overflow occurs.

During an overflow, the least n signifant bits of the result are stored in the target register resulting in a number that is smaller than at least one of the addends. The n^{th} bit, called the carry bit, would have been set to 1 but due to lack of register storage is lost instead.

In order to perform additions on bit-vectors of greater than size n, we will need to somehow perform repeated additions using register sized slices of the bit-vector, recover the carry bits and propagate them.

Recovering the carry

As mentioned earlier, the carry bit is notionally set when there is an overflow and an overflow also results in a truncated value that is smaller than at least one of the addends.

We can therefore determine whether the carry bit should be set by testing for the latter condition: total < a \lor total < b

This allows us to write a function that returns both the total and the carry:

This is fine for adding the first two words in our bit-vector, but the addition of following pairs of words will need to incorporate the carry. This ends up being a threeway addition that includes an input carry.

We can write another function addCarry to perform a threeway addition (full source).

addCarry is implemented by calling add to add the first two numbers and then converting the carry to a 1 or 0 and adding that in also.

Running this code shows that it takes 3.1 seconds to run, which is fairly slow:

First optimisation: Avoiding Bool

We can do better by avoiding the use of Bool, which is a sum type.

Using such types are fairly risky in high-performance code because the complexity of representing them in memory makes them slow. Ideally, we would like to be able to use them and depend on the compiler to fully optimised them away, but this doesn’t always happen.

We can see that GHC has failed to optimise away these constants by looking at GHC core. GHC core is an intermediate representation used by the compiler for things such as optimisation.

We can instruct GHC to emit GHC core by invoking it with additional flags:

The GHC core for addCarry is reproduced here:

We can see from the above dump that the addCarry function has nicely inlined away all the calls to add. We can also observe the use of True and False on lines values on lines 16, 17, 38, 39, and 40.

Moreover, we can count the number of branch instructions in the core by looking at case statements that have at least two branches.

Such case statements can be identified on lines 15, 28, 31, and 34, adding up to 4 branches in addCarry.

We can avoid the use of the inefficient data type by replacing Bool with Word64, True with 1 and False with 0.

This simple change also allows us to avoid the if expression previously used to component the total, so we can expect also the avoid one of the branches as well (full source):

Notice in particular that the carry0 || carry1 became carry0 + carry1 rather than carry0 .|. carry1.

This is to ensure that addCarry behaves correctly for all possible inputs even though we never use a value of c other than 1 or 0, under which circumstances either would work.

This results in the following core:

The core shows now three branching case expressions on lines 31, 34 and 39.

Let’s see how this performs:

That works out to be a performance boost of about 63.7%.

Branchless tests

We can do a little bit better by further trying to reduce the number of branches in our code.

Let’s zoom in to a smaller section of the core we saw before:

On lines 9, 12 and 20 are mysterious calls to a function called ltWord#.

The ltWord# function is actually what we call a primop: A special function that is implemented internally by GHC. The ltWord# primop behaves like (<) except that it will return 1# instead of True and 0# instead of False.

We can find this function in the docs as having the type Word# -> Word# -> Int#

Notice that the primop uses the types Word# and Int#. These are unboxed values. These types have less overhead than the boxed types like Word64 or Bool, and GHC will optimise boxed types to unboxed types when it believes it can safely do so.

Our performance problem lies in the fact that we whenever our compiled code calls ltWord# it tests against the return value for DEFAULT (ie. 0#) and 1#, whereas if we recognised that it was an integer and used it in arithmetic directly, we can avoid the branch.

But because ltWord# has unboxed types in its signature, calling it directly would be inconvenient so we write a wrapper that uses boxed types and depend on GHC to optimise away our pessimisation:

Notice the type signature we’ve chosen to use is Word64 -> Word64 -> Word64 instead of the Word64 -> Word64 -> Bool of the (<) operator.

Such tests are called branchless comparisons because a branch is not required to use the result of the comparison.

We can then define a version of add without an if statement:

Looking at the resulting core we see:

And we are happy to find that all the branches are gone.

Running this version of the code yields run in less time again:

This optimisation shaves another modest 9.0% of the runtime bringing the savings to a total of 72.6%.

A little refactoring

Whilst we’re here, I would like to mention there is another possible implementation of the add function.

Until now, we’ve detected when an overflow has happened by using the following test total < a \lor total < b.

We might have tried the alternative more intuitive test of total < a + b instead, but sadly that does not work because as mentioned earlier, we lose the carry bit.

We can however make the observation that unconstrained by word sizes, the test is valid and we can use this to derive an alternative test that also works.

Jumping back to our first implementation the expression total < a \lor total < b can be rewritten as total < a \verb+ max + b, where \verb+max+ is a function that returns the largest argument. That is to say, the total is less than any of the other numbers if the total is less than the largest of them.

We now have two different expressions that test for overflow:

  • total < a \verb+ max + b
  • total < a + b

We can then make the observation that the following inequality holds:

a \verb+ max + b \leq a + b

Despite them both being valid tests for overflow, they are not necessarily equal. For example when any of the values a or b differ from each other then the expressions are not equal, but in this case, the left is less than the right.

We can then devise a simpler expression that we can prove evaluates to a result that lies between the two extremes and being bounded by those other expression means that it also constitutes a valid test for overflow.

a \lor b is one such expression.

This means the following inequality should hold:

a \verb+ max + b \leq a \lor b \leq a + b

We know the left inequality is true because the largest addend ORed with any other number will be at least the value of the largest addend.

We know the right inequality to be true because we can consider two cases. If there is no overlap in the bit patterns of the addends then the LHS and the RHS end up being equal. If there is overlap, then the LHS is conceptually like a sum that has lost those bits where there is overlap in one of the registers, resulting in a value smaller than that given by an actual addition.

We can therefore substitute the following test instead which has fewer instructions: total < a \lor b

This results in the following implementation:

In practise the impact of reducing the implementation by this many instructions is negligable, but the latter technique is used in the hw-dsv library, so it needed some explanation.

Closing Remarks

In this blog post implemented the addCarry and explored different implementations and evaluated various optimisation strategies such uses avoiding the boxed type Bool and using branchless comparisons to reduce the number of branches in the optimised code.

All that remains in order to perform a bit-vector addition is to invoke the addCarry function over all the words in the intput bit-vectors, ensuring that carries are propagated properly between adjacent additions.

A future blog post will look at how we can write such a function in a pure functional setting where we require mutation to populate the result bit-vector, but still want to retain the safety that immutability affords.

Appendix

Summary of Benchmarks

The source code for the above benchmarks can be found in the Branchiest, Branchy, and Branchless modules.

References