In a previous post, I described how addition can be used flip runs of bits, but I left the method of adding large bit-vectors unexplained.
This post will fill this gap by explaining how to add large bit-vectors and how to do so efficiently.
Examples in this post will work with examples that use vectors of Word8
in order to be concise in the explanation, whereas the real implementation will use vectors of Word64
for efficiency.
As per usual, Word
s when expressed in binary form are expressed in Little Endian.
Overflows
CPUs perform addition on registers of finite size. We shall describe the register in question as having bits indexed from to . When adding large integers, there inevitably comes a point where the resulting integer requires more than bits to store and will not fit in the register and an overflow occurs.
During an overflow, the least signifant bits of the result are stored in the target register resulting in a number that is smaller than at least one of the addends. The bit, called the carry bit, would have been set to but due to lack of register storage is lost instead.
In order to perform additions on bit-vectors of greater than size , we will need to somehow perform repeated additions using register sized slices of the bit-vector, recover the carry bits and propagate them.
Recovering the carry
As mentioned earlier, the carry bit is notionally set when there is an overflow and an overflow also results in a truncated value that is smaller than at least one of the addends.
We can therefore determine whether the carry bit should be set by testing for the latter condition:
This allows us to write a function that returns both the and the :
add :: Wor64 -> Wor64 -> (Wor64, Bool)
add a b = (total, carry)
where total = a + b
carry = if total < a || total < b then 1 else 0
This is fine for adding the first two words in our bit-vector, but the addition of following pairs of words will need to incorporate the carry. This ends up being a threeway addition that includes an input carry.
We can write another function addCarry
to perform a threeway addition (full source).
addCarry :: Word64 -> Word64 -> Bool -> (Word64, Bool)
addCarry a b c = (t, carry0 || carry1)
where (s, carry0) = add a b
(t, carry1) = add s (if c then 1 else 0)
addCarry
is implemented by calling add
to add the first two numbers and then converting the carry to a 1
or 0
and adding that in also.
Running this code shows that it takes 3.1 seconds
to run, which is fairly slow:
$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchiest
3.674
First optimisation: Avoiding Bool
We can do better by avoiding the use of Bool
, which is a sum type.
Using such types are fairly risky in high-performance code because the complexity of representing them in memory makes them slow. Ideally, we would like to be able to use them and depend on the compiler to fully optimised them away, but this doesn’t always happen.
We can see that GHC has failed to optimise away these constants by looking at GHC core. GHC core is an intermediate representation used by the compiler for things such as optimisation.
We can instruct GHC to emit GHC core by invoking it with additional flags:
The GHC core for addCarry
is reproduced here:
$waddCarry :: GHC.Word.Word64 -> GHC.Word.Word64 -> GHC.Types.Bool -> (# GHC.Word.Word64, GHC.Types.Bool #)
{- Arity: 3, HasNoCafRefs, Strictness: <L,U(U)><L,U(U)><L,1*U>,
Inline: [0],
Unfolding: (\ (w :: GHC.Word.Word64)
(w1 :: GHC.Word.Word64)
(w2 :: GHC.Types.Bool) ->
let {
total :: GHC.Word.Word64
= case w of wild { GHC.Word.W64# x# ->
case w1 of wild1 { GHC.Word.W64# y# ->
GHC.Word.W64# (GHC.Prim.plusWord# x# y#) } }
} in
let {
b :: GHC.Word.Word64
= case w2 of wild {
GHC.Types.False -> Ops.SumBitVectors.Word64.Branchiest.addCarry2
GHC.Types.True -> Ops.SumBitVectors.Word64.Branchiest.addCarry1 }
} in
let {
total1 :: GHC.Word.Word64
= case total of wild { GHC.Word.W64# x# ->
case b of wild1 { GHC.Word.W64# y# ->
GHC.Word.W64# (GHC.Prim.plusWord# x# y#) } }
} in
(# total1,
case total of wild { GHC.Word.W64# x ->
case w of wild1 { GHC.Word.W64# y ->
case GHC.Prim.ltWord# x y of lwild {
DEFAULT
-> case w1 of wild2 { GHC.Word.W64# y1 ->
case GHC.Prim.ltWord# x y1 of lwild1 {
DEFAULT
-> case total1 of wild3 { GHC.Word.W64# x1 ->
case GHC.Prim.ltWord# x1 x of lwild2 {
DEFAULT
-> case b of wild4 { GHC.Word.W64# y2 ->
GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.ltWord# x1 y2) }
1# -> GHC.Types.True } }
1# -> GHC.Types.True } }
1# -> GHC.Types.True } } } #)) -}
We can see from the above dump that the addCarry
function has nicely inlined away all the calls to add
. We can also observe the use of True
and False
on lines values on lines 16
, 17
, 38
, 39
, and 40
.
Moreover, we can count the number of branch instructions in the core by looking at case
statements that have at least two branches.
Such case statements can be identified on lines 15
, 28
, 31
, and 34
, adding up to 4
branches in addCarry
.
We can avoid the use of the inefficient data type by replacing Bool
with Word64
, True
with 1
and False
with 0
.
This simple change also allows us to avoid the if
expression previously used to component the , so we can expect also the avoid one of the branches as well (full source):
add :: Word64 -> Word64 -> (Word64, Word64)
add a b = (total, newCarry)
where total = a + b
newCarry = if total < a || total < b then 1 else 0
addCarry :: Word64 -> Word64 -> Word64 -> (Word64, Word64)
addCarry a b c = (t, carry0 + carry1)
where (s, carry0) = add a b
(t, carry1) = add s c
Notice in particular that the carry0 || carry1
became carry0 + carry1
rather than carry0 .|. carry1
.
This is to ensure that addCarry
behaves correctly for all possible inputs even though we never use a value of c
other than 1
or 0
, under which circumstances either would work.
This results in the following core:
$waddCarry ::
GHC.Word.Word64
-> GHC.Word.Word64
-> GHC.Word.Word64
-> (# GHC.Word.Word64, GHC.Word.Word64 #)
{- Arity: 3, HasNoCafRefs, Strictness: <L,U(U)><L,U(U)><L,U(U)>,
Inline: [0],
Unfolding: (\ (w :: GHC.Word.Word64)
(w1 :: GHC.Word.Word64)
(w2 :: GHC.Word.Word64) ->
let {
total :: GHC.Word.Word64
= case w of wild { GHC.Word.W64# x# ->
case w1 of wild1 { GHC.Word.W64# y# ->
GHC.Word.W64# (GHC.Prim.plusWord# x# y#) } }
} in
let {
total1 :: GHC.Word.Word64
= case total of wild { GHC.Word.W64# x# ->
case w2 of wild1 { GHC.Word.W64# y# ->
GHC.Word.W64# (GHC.Prim.plusWord# x# y#) } }
} in
(# total1,
case total of wild { GHC.Word.W64# x ->
case w of wild1 { GHC.Word.W64# y ->
let {
$j :: GHC.Prim.Word# -> GHC.Word.Word64
<join 1> {- Arity: 1, Strictness: <S,U>m -}
= \ (x# :: GHC.Prim.Word#)[OneShot] ->
case total1 of wild2 { GHC.Word.W64# x1 ->
case GHC.Prim.ltWord# x1 x of lwild {
DEFAULT
-> case w2 of wild3 { GHC.Word.W64# y1 ->
case GHC.Prim.ltWord# x1 y1 of lwild1 {
DEFAULT -> GHC.Word.W64# x#
1# -> GHC.Word.W64# (GHC.Prim.or# x# 1##) } }
1# -> GHC.Word.W64# (GHC.Prim.or# x# 1##) } }
} in
case GHC.Prim.ltWord# x y of lwild {
DEFAULT
-> case w1 of wild2 { GHC.Word.W64# y1 ->
case GHC.Prim.ltWord# x y1 of lwild1 {
DEFAULT -> $j 0## 1# -> $j 1## } }
1# -> $j 1## } } } #)) -}
The core shows now three branching case
expressions on lines 31
, 34
and 39
.
Let’s see how this performs:
$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchy
1.334
That works out to be a performance boost of about 63.7%
.
Branchless tests
We can do a little bit better by further trying to reduce the number of branches in our code.
Let’s zoom in to a smaller section of the core we saw before:
(# total1,
case total of wild { GHC.Word.W64# x ->
case w of wild1 { GHC.Word.W64# y ->
let {
$j :: GHC.Prim.Word# -> GHC.Word.Word64
<join 1> {- Arity: 1, Strictness: <S,U>m -}
= \ (x# :: GHC.Prim.Word#)[OneShot] ->
case total1 of wild2 { GHC.Word.W64# x1 ->
case GHC.Prim.ltWord# x1 x of lwild {
DEFAULT
-> case w2 of wild3 { GHC.Word.W64# y1 ->
case GHC.Prim.ltWord# x1 y1 of lwild1 {
DEFAULT -> GHC.Word.W64# x#
1# -> GHC.Word.W64# (GHC.Prim.or# x# 1##) } }
1# -> GHC.Word.W64# (GHC.Prim.or# x# 1##) } }
} in
case GHC.Prim.ltWord# x y of lwild {
DEFAULT
-> case w1 of wild2 { GHC.Word.W64# y1 ->
case GHC.Prim.ltWord# x y1 of lwild1 {
DEFAULT -> $j 0## 1# -> $j 1## } }
1# -> $j 1## } } } #)) -}
On lines 9
, 12
and 20
are mysterious calls to a function called ltWord#
.
The ltWord#
function is actually what we call a primop: A special function that is implemented internally by GHC. The ltWord#
primop behaves like (<)
except that it will return 1#
instead of True
and 0#
instead of False
.
We can find this function in the docs as having the type Word# -> Word# -> Int#
Notice that the primop uses the types Word#
and Int#
. These are unboxed values. These types have less overhead than the boxed types like Word64
or Bool
, and GHC will optimise boxed types to unboxed types when it believes it can safely do so.
Our performance problem lies in the fact that we whenever our compiled code calls ltWord#
it tests against the return value for DEFAULT
(ie. 0#
) and 1#
, whereas if we recognised that it was an integer and used it in arithmetic directly, we can avoid the branch.
But because ltWord#
has unboxed types in its signature, calling it directly would be inconvenient so we write a wrapper that uses boxed types and depend on GHC to optimise away our pessimisation:
{-# LANGUAGE MagicHash #-}
import GHC.Int
import GHC.Prim
import GHC.Word hiding (ltWord)
ltWord :: Word64 -> Word64 -> Word64
ltWord (W64# a#) (W64# b#) = fromIntegral (I64# (ltWord# a# b#))
Notice the type signature we’ve chosen to use is Word64 -> Word64 -> Word64
instead of the Word64 -> Word64 -> Bool
of the (<)
operator.
Such tests are called branchless comparisons because a branch is not required to use the result of the comparison.
We can then define a version of add
without an if
statement:
add :: Word64 -> Word64 -> (Word64, Word64)
add a b = (total, newCarry)
where total = a + b
newCarry = total `ltWord` a || total `ltWord` b
Looking at the resulting core we see:
$waddCarry :: GHC.Word.Word64 -> GHC.Word.Word64 -> GHC.Word.Word64 -> (# GHC.Word.Word64, GHC.Word.Word64 #)
{- Arity: 3, HasNoCafRefs, Strictness: <L,U(U)><L,U(U)><L,U(U)>,
Inline: [0],
Unfolding: (\ (w :: GHC.Word.Word64)
(w1 :: GHC.Word.Word64)
(w2 :: GHC.Word.Word64) ->
let {
total :: GHC.Word.Word64
= case w of wild { GHC.Word.W64# x# ->
case w1 of wild1 { GHC.Word.W64# y# ->
GHC.Word.W64# (GHC.Prim.plusWord# x# y#) } }
} in
let {
total1 :: GHC.Word.Word64
= case total of wild { GHC.Word.W64# x# ->
case w2 of wild1 { GHC.Word.W64# y# ->
GHC.Word.W64# (GHC.Prim.plusWord# x# y#) } }
} in
(# total1,
case total of wild { GHC.Word.W64# a# ->
case w of wild1 { GHC.Word.W64# x# ->
case w1 of wild2 { GHC.Word.W64# y# ->
case total1 of wild3 { GHC.Word.W64# a#1 ->
case w2 of wild4 { GHC.Word.W64# y#1 ->
GHC.Word.W64#
(GHC.Prim.or#
(GHC.Prim.int2Word# (GHC.Prim.ltWord# a# (GHC.Prim.or# x# y#)))
(GHC.Prim.int2Word#
(GHC.Prim.ltWord# a#1 (GHC.Prim.or# a# y#1)))) } } } } } #)) -}
And we are happy to find that all the branches are gone.
Running this version of the code yields run in less time again:
$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchless
1.005
This optimisation shaves another modest 9.0%
of the runtime bringing the savings to a total of 72.6%
.
A little refactoring
Whilst we’re here, I would like to mention there is another possible implementation of the add
function.
Until now, we’ve detected when an overflow has happened by using the following test .
We might have tried the alternative more intuitive test of instead, but sadly that does not work because as mentioned earlier, we lose the carry bit.
We can however make the observation that unconstrained by word sizes, the test is valid and we can use this to derive an alternative test that also works.
Jumping back to our first implementation the expression can be rewritten as , where is a function that returns the largest argument. That is to say, the total is less than any of the other numbers if the total is less than the largest of them.
We now have two different expressions that test for overflow:
We can then make the observation that the following inequality holds:
Despite them both being valid tests for overflow, they are not necessarily equal. For example when any of the values or differ from each other then the expressions are not equal, but in this case, the left is less than the right.
We can then devise a simpler expression that we can prove evaluates to a result that lies between the two extremes and being bounded by those other expression means that it also constitutes a valid test for overflow.
is one such expression.
This means the following inequality should hold:
We know the left inequality is true because the largest addend ORed with any other number will be at least the value of the largest addend.
We know the right inequality to be true because we can consider two cases. If there is no overlap in the bit patterns of the addends then the LHS and the RHS end up being equal. If there is overlap, then the LHS is conceptually like a sum that has lost those bits where there is overlap in one of the registers, resulting in a value smaller than that given by an actual addition.
We can therefore substitute the following test instead which has fewer instructions:
This results in the following implementation:
add :: Word64 -> Word64 -> (Word64, Word64)
add a b = (total, newCarry)
where total = a + b
newCarry = total `ltWord` (a .|. b)
In practise the impact of reducing the implementation by this many instructions is negligable, but the latter technique is used in the hw-dsv
library, so it needed some explanation.
Closing Remarks
In this blog post implemented the addCarry
and explored different implementations and evaluated various optimisation strategies such uses avoiding the boxed type Bool
and using branchless comparisons to reduce the number of branches in the optimised code.
All that remains in order to perform a bit-vector addition is to invoke the addCarry
function over all the words in the intput bit-vectors, ensuring that carries are propagated properly between adjacent additions.
A future blog post will look at how we can write such a function in a pure functional setting where we require mutation to populate the result bit-vector, but still want to retain the safety that immutability affords.
Appendix
Summary of Benchmarks
$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchiest
3.674
$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchy
1.334
$ time ex-vector sum-bit-vectors -i ../hw-json/corpus/bench/78mb.json -i ../hw-json/corpus/bench/78mb.json --branchiness branchless
1.005
The source code for the above benchmarks can be found in the Branchiest, Branchy, and Branchless modules.