In last week’s post I described how to produce rank-select bit-strings for an RFC-compliant CSV format without detailing how to perform efficiently the operation that splits odd and even bits of a bit-string into separate bit-strings.
In an earlier post I discussed rank
and select
operations, but also omitted to describe how they can be implemented efficiently.
In this blog post I will properly introduce the popcnt
, pext
, tzcnt
and pdep
operations and how they relate to the performance of our conceptual succinct data-structure based CSV parser.
Bit-manipulation instructions
Pop Count
The popcnt
operation is quite straight-forward. It takes an integer as its input argument, counts the number of 1-bits and produces an integer containing the count.
In the following example, I count the number of 1-bits in my bit-string and find there are 5:
~/wrk/haskell-works/hw-rankselect-base $
$ stack repl
λ> import Data.Word
λ> let bs = fromJust $ bitRead "01110101" :: Word8
λ> popCount1 bs
5
Trailing Zeros Count
The tzcnt
operation is also quite straight-forward. It counts the number of “trailing zeros” in the bit-string.
Unfortunately, because we are on the topic of succinct data-structures, all our bits are expressed in Little-Endian, so visually we are looking at the leading zeros.
In the following example, I count the number of trailing zeros in my bit-string and find there are 3 (as indicated by the ***
annotation):
~/wrk/haskell-works/hw-rankselect-base $
$ stack repl
λ> import Data.Word ***
λ> let bs = fromJust $ bitRead "00010101" :: Word8
λ> countTrailingZeros bs
3
Parallel Extract
The pext
or “parallel extract” operation , is an operation that takes a bit-mask and extracts the bits from the source
word corresponding to 1-bits in the bit-mask and packs them into least-significant side of the word.
* All words in Little-Endian *
mask 0011001100110011
source 0110101110101001
bits 10 11 10 01
┌─┘│ ││ ││ ││
│┌─┘ ││ ││ ││
││┌───┘│ ││ ││
│││┌───┘ ││ ││
││││┌─────┘│ ││
│││││┌─────┘ ││
││││││┌───────┘│
│││││││┌───────┘
result 1011100100000000
In the following example, I extract the high nibble of each byte in my Word32
by using a mask where all the high-nibble bits are set to 1
:
~/wrk/haskell-works/hw-rankselect-base $
$ stack repl
λ> import Data.Word
λ> import Data.Bits.Pext
λ> let source = fromJust $ bitRead "01110101 11001010 00001001 01011000" :: Word32
λ> let mask = fromJust $ bitRead "00001111 00001111 00001111 00001111" :: Word32
λ> bitShow $ pext source mask
"01011010 10011000 00000000 00000000"
Parallel Deposit
The pdep
or “parallel deposit” operation, is an operation that takes a bit-mask and deposits the least-significant n-bits from the source
word where n
is the number of bits in the bit-mask and deposits them at the positions marked by 1-bits in the bit-mask.
* All words in Little-Endian *
mask 0011001100110011
source 1011100100000000
│││││││└───────┐
││││││└───────┐│
│││││└─────┐ ││
││││└─────┐│ ││
│││└───┐ ││ ││
││└───┐│ ││ ││
│└─┐ ││ ││ ││
└─┐│ ││ ││ ││
bits 10 11 10 01
bits 0010001100100001
In the following example, I deposit the first eight bits in my bits-string into the high nibble of each byte in my Word32
, again by using a mask where all the high-nibble bits are set to 1
:
~/wrk/haskell-works/hw-rankselect-base $
$ stack repl
λ> import Data.Word
λ> import Data.Bits.Pdep
λ> let source = fromJust $ bitRead "01011010 10011000 00000000 00000000" :: Word32
λ> let mask = fromJust $ bitRead "00001111 00001111 00001111 00001111" :: Word32
λ> bitShow $ pdep source mask
"00000101 00001010 00001001 00001000"
Availability on GHC
In Haskell, the popCount
operation is available via the popCount
function in Data.Bits
module of the base package.
On most modern x86 systems, the popCount
operation is available on 64-bit integers in the form of the POPCNT
CPU instruction, so it is very fast.
It is also available to C programs by way of the _popcnt64
CPU intrinsic, but what about Haskell?
To confirm that the function does in fact compile down to an instruction we first check the definition here.
and notice the use of the primop popCnt#
.
Digging further into the GHC source code:
sse4_2 <- sse4_2Enabled
let platform = targetPlatform dflags
if sse4_2
then
....
unitOL (MOVZxL II8 (OpReg src_r) (OpReg src_r)) `appOL`
unitOL (POPCNT II16 (OpReg src_r) dst_r)
We find that this primop generates the POPCNT
instruction provided that sse4_2Enabled
yields True
, which according to the GHC User Guide can be switched on with the -msse4.2
ghc
flag since GHC 7.4.1 on x86 CPU architectures.
A similar search can be done for the tzcnt
operation, but unfortunately the equivalent primop expands to a whole bunch of instructions:
([ MOV II32 (OpReg rhi) (OpReg tmp_r)
, OR II32 (OpReg rlo) (OpReg tmp_r)
, MOV II32 (OpImm (ImmInt 64)) (OpReg dst_r)
, JXX EQQ lbl2
, JXX ALWAYS lbl1
, NEWBLOCK lbl1
, BSF II32 (OpReg rhi) dst_r
, ADD II32 (OpImm (ImmInt 32)) (OpReg dst_r)
, BSF II32 (OpReg rlo) tmp_r
, CMOV NE II32 (OpReg tmp_r) dst_r
, JXX ALWAYS lbl2
, NEWBLOCK lbl2
])
The operation is emulated with BSF
or Bit-Scan-Forward, which does something similar, but only works on 32-bit integers.
Unfortunately for me, I needed optimised pdep
and pext
primops from ghc
for my succinct data-structure libraries, but sadly they weren’t available at the time I sought them, which was about this time last year.
I created a trac issue, and one thing led to another and I ended up implementing these primops in ghc
over the space of several months with the kind help of Ben Gamari and Moritz Angermann.
The functionality is not exposed in base
, but they can be accessed from the Haskell-Works bits-extra library. Benchmark results can be found there-in.
If anyone is looking to jump into GHC development, a good first project that can improve the performance of succinct data-structures on Haskell code would be to add native support for ctz#
primop in GHC.
GHC needs more contributions and it would be great to see GHC become a better platform for writing high performance code.
Applications
Very fast implementation of Rank and Select
About this time last year, Ed Kmett, sent me very interesting paper on A General-Purpose Counting Filter: Making Every Bit Count.
To really miss the amazing contribution of the paper and zoom in on the section relevant to this discussion go to section 3.2 “Fast x86 rank and select”, which describes how the PDEP
and TZCNT
instructions can be used to implement fast rank and select.
The formula for rank is given in the paper as:
The rank i
of a bit-string is the prefix-sum of the bit-string of the given length i
.
The diagram below shows five examples of rank operations on the same bit-string (b).
For the operation rank i
, the expression 2ⁱ-1
computes a bit-mask (a) consisting of a prefix of 1-bits of length i
that can be used to zero in our bit-string (b) all the bits beyond the prefix length to produce (c).
A pop count of (c) counts all the 1-bits annotated by *
(d) and becomes our rank (e).
rank 2 rank 4 rank 6 rank 8 rank 10
(a) 1100000000000000 1111000000000000 1111110000000000 1111111100000000 1111111111000000
(b) 0100100010011000 0100100010011000 0100100010011000 0100100010011000 0100100010011000
(c) 0100000000000000 0100000000000000 0100100000000000 0100100000000000 0100100010000000
(d) * * * * * * * * *
(e) 1 1 2 2 3
The select i
of a bit-string is length of the smallest prefix that includes i
1-bits in the bit-string. Unfortunately, I feel like there is an off-by-one error in the formula and that it should actually be:
The diagram below shows five examples of select operations on the same bit-string (b).
For the operation select i
, the expression 2
i-1
computes a bit-string containing exactly one 1-bit at the i-th position in the bit-string (1-based).
The pdep
operation is called to deposit this bit at the position corresponding to the n-th 1-bit in (b) to produce the result (c).
The number of trailing zeros (indicated by the *
) when added to one yields the value of select i
.
select 1 select 2 select 3 select 4 select 5
(a) 1000000000000000 0100000000000000 0010000000000000 0010000000000000 0010000000000000
││││└───────┐ ││││└───────┐ ││││└───────┐ ││││└───────┐ ││││└───────┐
│││└────┐ │ │││└────┐ │ │││└────┐ │ │││└────┐ │ │││└────┐ │
││└───┐ │ │ ││└───┐ │ │ ││└───┐ │ │ ││└───┐ │ │ ││└───┐ │ │
│└─┐ │ │ │ │└─┐ │ │ │ │└─┐ │ │ │ │└─┐ │ │ │ │└─┐ │ │ │
(b) 1001001010001000 1001001010001000 1001001010001000 1001001010001000 1001001010001000
(c) 1000000000000000 0001000000000000 0000001000000000 0000000010000000 0000000000001000
(d) 1 *** 4 ****** 7 ******** 9 ************ 13
Splitting a bit-string to odd and even bits
In last week’s post, I explained that in order to create the rank-select bit-strings for an RFC compliant CSV format, I needed an operation that can take a bit-string and collect all the odd bits into one bit-string and all the even-bits into another.
The example I gave is reproduced here:
text: aaa,bbb,ccc␍␊"a""aa","b␍␊bb","c,cc"
quotes: 00000000000001011001010000010100001
parens: 0000000000000(0)(00)0(00000)0(0000)
enters: 00000000000001001000010000000100000
exits: 00000000000000010001000000010000001
In this post I will be using odds
in place of enters
because these are the odd
bits and evens
in place of exits
because these are the even
bits in our bit-string.
The opening parentheses (
represents all the 1-bits that opening quotes of a quoted string and the closing parentheses )
represents all the 1-bits that represents for the closing quotes.
We need to build odds
which has all the odd 1-bits (ie. the opening quotes) and leaves
which has all the even 1-bits (ie. the closing quotes).
We must somehow be able to produce the bit-strings odds
and evens
from the bit-string quotes
very efficiently.
This is actually very easy with a single application of the pdep
operation for each bit-string we produce:
~/wrk/haskell-works/hw-rankselect-base $
$ stack repl
λ> import Data.Bits.Pdep
λ> let bs = fromJust $ bitRead "00000000000001011001010000010100001000000000" :: Word64
λ> let odds = fromJust $ bitRead "10101010101010101010101010101010101010101010" :: Word64
λ> let evens = fromJust $ bitRead "01010101010101010101010101010101010101010101" :: Word64
λ> bitShow bs
"00000000 00000101 10010100 00010100 00100000 00000000 00000000 00000000"
λ> bitShow $ pdep odds bs
"00000000 00000100 10000100 00000100 00000000 00000000 00000000 00000000"
λ> bitShow $ pdep evens bs
"00000000 00000001 00010000 00010000 00100000 00000000 00000000 00000000"
I will leave it to the reader to work out why this works. 😉
Next steps
We now have very fast rank-select operations for short bit-vectors of 64-bits, which is sufficient for CSV streaming because it allows us to process 64 bytes of CSV text at a time.
We also have the ability to split the odds and even bits out of our bit-string into separate bit-strings.
All the conceptual pieces needed to produce the necessary rank-select bit-strings for our high-performance RFC compliant CSV parser and pieces need to subsequently traverse the CSV text and extract interesting data have been described.
In my next post I will talk about how SIMD instructions can be used to make our parser go even faster!
Stay tuned!