Introduction to SIMD with linecount

Posted on September 3, 2018 by John Ky

In a previous post I talked about using broadword techniques to create a rank-select bit-string from text.

This post will explore using Single Instruction, Multiple Data (SIMD) instructions to achieve the same thing.

The exploration will benchmark the SIMD, broadword and naive implementations of line count to illustrate why SIMD instructions are important to improving parsing performance.

Creating a rank-select bit-string with SIMD

Recall from the earlier post that we can build a rank-select bit-string from a piece of text like this:

"name","age","profession"␤John,30,Code Monkey␤Kyle,40,Data Scrubber
0000000000000000000000000100000000000000000001000000000000000000000

Here, I mark every bit that corresponds to a newline character at the corresponding position in the text.

From this bit-string, we can use the popCount operation to add up the number of 1-bits in the string to arrive at the line count for our text.

Recall that we were able to use broadword programming techniques to do this conversion:

This approach allowed us to process 8 bytes at a time at the cost of three OR (.|.), three SHIFT (.>.) and one pext operations, all of which are very cheap.

Whilst, broadword techniques allowed us to do this 8 bytes at a time, SIMD instructions that exist on some generations of CPUs can help us build our rank-select bit-string in a way that lets us process 16, 32 or even 64-bytes at a time.

On my Macbook, I can run a command to determine what the features my CPU has:

$ sysctl -a | grep cpu | grep features:
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI

From this list, I know that my CPU supports the AVX2 instruction set which makes available CPU intrinsics such as _mm256_cmpeq_epi8 asn _mm256_movemask_epi8.

With 256-bit registers, we’re in a position to improve our parallelism from 8 to 32 bytes at a time.

Now let’s starting indexing out text.

In C, we first need to initialise a SIMD register to contain 32 copies of our delimiter byte (in this case newline characters):

then we loop over the bytes 32-bytes at a time and perform a parallel comparison on each chunk:

The above code compares each byte in ws_newlines with the corresponding byte in (__m256i*)text.

matches_bytes will contain 32 bytes, each of which will hold one of two values: 0x00 or 0xff dependending on whether the corresponding byte in (__m256i*)text was equal to the corresponding by in matches_bytes.

The _mm256_movemask_epi8 function is then used to compress this result by taking the high bit of each byte in matches_bytes and packing them into matches_bits.

Compared to the broadword implementation, this reduces the number of load/stores by a factor of 8 and the number of register instructions for a 32-byte chunk from 56 to 2.

Benchmarking

To look at the potential performance gains from using SIMD versus other alternatives, I’ve written three versions of a tool that counts the number of lines in a text file: naive, broadword and simd

Naive

The naive version does nothing more than traverse the text byte-by-byte and compare each byte for equality and incrementing a count on success:

Broadword

The broadword version uses the XOR operation to perform parallel comparison of all 8 bytes in the word with one instruction and relies on a small number of SHIFTS (>>) and ANDS (&) to convert the result into the bit-string we need:

The function falls back to byte-by-byte comparison if there are any bytes left that cannot fill a 64-bit word.

SIMD

The SIMD version requires a register to be initialised with the our delimiter replicated to all bytes of the SIMD register in the ws_newlines argument with the _mm256_set1_epi8 intrinsic.

This value is then compared to the text 64-bytes at a time with the _mm256_cmpeq_epi8 intrinsic and summarised into a bit-string with the _mm256_movemask_epi8 intrinsic.

The function falls back to byte-by-byte comparison if there are any bytes left that cannot fill a 256-bit SIMD register.

Initial benchmark results

I benchmark the three versions of the code in C as well as wc -l as a baseline.

The source code for the benchmarks can be found here and the results here.

The code is compiled with -mavx2 -mbmi2 to enable access to the SIMD and bit manipulation instructions I need.

wc -l                   0m5.458s
./naive.auto.out        0m2.654s
./broadword.auto.out    0m2.963s
./simd.auto.out         0m1.759s

I find that wc -l is the slowest and the simd implementation the fastest and the broadword implementation performs somewhere in between.

That much is as expected.

The surprise here is that the naive version runs faster than the broadword version.

What is going on here?

Before jumping to conclusions, let’s take a look at the assembly code generated for the naive version:

_process_data:                          ## @process_data
    .cfi_startproc
...
## BB#8:
    leaq            -1(%r8), %rdx
    subq            %rax, %rdx
    vpxor           %ymm8, %ymm8, %ymm8
    xorl            %eax, %eax
    vpbroadcastd    LCPI0_0(%rip), %xmm4 ## xmm4 = [10,10,10,10]
    vpbroadcastq    LCPI0_1(%rip), %ymm5 ## ymm5 = [1,1,1,1]
    vpxor           %ymm9, %ymm9, %ymm9
    vpxor           %ymm2, %ymm2, %ymm2
    vpxor           %ymm3, %ymm3, %ymm3
    .p2align 4, 0x90
...
    .cfi_endproc

If we lookup the Intel documentation for the vpbroadcastq and vpxor instructions we find that these functions are SIMD instructions and that gcc has auto-vectorised the naive implementation.

The naive implementation was in fact elaborately auto-vectorised that a 14 line C function became 152 lines of assembly code.

If we deny gcc the freedom to auto-vectorise the code with the -fno-tree-vectorize flag we get 71 lines of assembly instead and suffer worse results by far:

./naive.out           0m4.659s
./broadword.out       0m2.960s
./simd.out            0m1.703s
wc -l                 0m5.458s

GHC currently does not perform any auto-vectorisation, so I’d expected the naive version when written in Haskell would perform no better than the naive C implementation without auto-vectorsation.

Closing remarks

The benchmarks make a compelling case for using SIMD instructions where ever possible and broadword where the target CPU architecture does not support the SIMD instructions we need.

Unfortunately, GHC does not have native support for the SIMD instructions we need.

In a future post I’ll look at using these SIMD from GHC using Foreign Function Interface (FFI) and addressing some of the challenges of using SIMD with Haskell’s lazy IO.