/fast-crc32

Fast CRC32 implementations

Primary LanguageCMIT LicenseMIT

This repository:

  • Establishes a terse syntax for describing a broad range of CRC32 implementations.
  • Contains a code generator for turning terse descriptions into executable C code.
  • Contains benchmarking apparatus for CRC32 implementations, along with a range of benchmark results.
  • Presents some novel CRC32C implementations which are faster than the previous state of the art.

Running make will generate and compile and benchmark a handful of different implementations. Alternatively, run make generate and then use ./generate to generate a particular implementation.

Terse syntax for describing CRC32 implementations

Step 1: Instruction set (-i)

Terse syntax gcc equivalent Example CPUs
-i neon aarch64 -march=armv8-a+crypto+crc Apple M1, GCP Tau T2A (Ampere Altra Arm)
-i neon_eor3 aarch64 -march=armv8.2-a+crypto+sha3 Apple M1
-i sse x86_64 -msse4.2 -mpclmul Any Intel or AMD CPU from the last 10 years
-i avx512 x86_64 -mavx512f -mavx512vl Intel Skylake X and newer, AMD Zen 4
-i avx512_vpclmulqdq x86_64 -mavx512f -mavx512vl -mvpclmulqdq Intel Ice Lake and newer, AMD Zen 4

For CRC32, the key feature required of any instruction set is a vector carryless multiplication instruction, e.g. pclmulqdq on x86_64 or pmull on aarch64. Three-way exclusive-or is also useful, e.g. vpternlogq on x86_64 or eor3 on aarch64.

Step 2: Choice of polynomial (-p)

Terse syntax Polynomial Hardware acceleration Example applications
-p crc32 0x04C11DB7 aarch64 Ethernet, SATA, zlib, png
-p crc32c 0x1EDC6F41 aarch64, x86_64 iSCSI, Btrfs, ext4
-p crc32k 0x741B8CD7 no
-p crc32q 0x814141AB no AIXM
-p 0x87654321 0x87654321 no none

Most new applications should probably use -p crc32c. A large number of existing applications use -p crc32. If you really care about this, then you can also specify an arbitrary polynomial.

Some instruction sets contain scalar instructions for accelerating particular CRC32 polynomials. Notably, aarch64 has scalar acceleration for -p crc32 and -p crc32c, whereas x86_64 only has scalar acceleration for -p crc32c.

Step 3: Algorithm string (-a)

Six different parameters control the generated algorithm:

  1. Number of vector accumulators to use (16 or 64 bytes each).
  2. Number of vector loads to perform per iteration (defaults to number of vector accumulators, and must be a multiple of the number of vector accumulators).
  3. Number of scalar accumulators to use (4 bytes each).
  4. Number of scalar 8-byte loads to perform per iteration (defaults to number of scalar accumulators, and must be a multiple of the number of scalar accumulators).
  5. The outer loop size, if any.
  6. Whether inner loop termination is length-based or pointer-based.

The number of vector accumulators is specified by the letter v, followed by a decimal number. If the number of vector loads per iteration is non-default, this is followed by the letter x, followed by the ratio of loads to accumulators as a decimal number. For example, v4 denotes four vector accumulators and four vector loads per iteration, and v3x2 denotes three vector accumulators and six vector loads per iteration.

The number of scalar accumulators is specified in a similar way, but with s instead of v. For example, s3 denotes three scalar accumulators and three scalar loads per iteration, and s5x3 denotes five scalar accumulators and 15 scalar loads per iteration.

An outer loop size (in bytes) can be specified by the letter k, followed by a decimal number. Using an outer loop slightly decreases the complexity of the inner loop (because the inner loop trip count becomes constant), but makes the implementation less well suited to input sizes that aren't a multiple of the outer loop size.

The inner loop condition can be based on the remaining length (the default), or based on pointer comparisons (specify the letter e). Length-based is easier for humans to comprehend. Pointer-based can make the inner loop slightly simpler, at the expense of slightly more logic outside the loop.

The various parameters are concatenated to give the algorithm string, for example:

  • -a v4 to use four vector registers.
  • -a v4x2s3x5k4096e to use four vector registers (8 vector loads per iteration), three scalar registers (15 scalar loads per iteration), outer block size 4096 bytes, and pointer-based loop termination.

Multiple sets of parameters can be separated by an _ character, for example -a v4_v1 uses a v4 algorithm for as long as possible, then switches to v1 for any remaining bytes. If there are still any remaining bytes, then an implicit _s1 is appended to deal with them.

Benchmark results

Apple M1 performance (single core)

Implementation Speed Our equivalent Speed
Chromium scalar 25.03 GB/s -i neon -p crc32 -a s3k95760_s3 25.50 GB/s
Chromium vector 26.57 GB/s -i neon -p crc32 -a v4_v1 34.03 GB/s
dougallj Faster CRC32 77.55 GB/s -i neon -p crc32 -a v12e_v1 77.69 GB/s
fusion-inspired N/A -i neon_eor3 -p crc32 -a v9s3x2e_s3 85.05 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.dylib chromium_neon_crc32_v4_v1.dylib dougallj_neon_crc32_v12e_v1.dylib
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.dylib third_party/chromium_neon_crc32_v4_v1.dylib third_party/dougallj_neon_crc32_v12e_v1.dylib
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1 -i neon_eor3 -a v9s3x2e_s3

Rich microarchitecture details are available, though the key details for CRC32 distill down to the following table:

Primitive Uops Load Scalar Vector Bytes Score Latency
ldr, pmull+eor, pmull2+eor 5/8 1/3 or 0.5/3 2/4 16 25.6 6
ldr, pmull, pmull2, eor3 4/8 1/3 or 0.5/3 3/4 16 21.3 5
ldr, pmull, pmull2, eor, eor 5/8 1/3 or 0.5/3 4/4 16 16.0 7
ldr, crc32x 2/8 1/3 or 0.5/3 1/1 8 8.0 3

Score is computed as Bytes / max(Uops, Load, Scalar, Vector), and gives the optimal bytes per cycle that we can hope to achieve with the corresponding primitive. The compiler might fuse two consecutive ldr instructions into a single ldp instruction, which reduces Load from 1/3 to 0.5/3, but does not decrease Uops. The CPU can fuse eor into an immediately preceding pmull or pmull2, which decreases the demand on the vector execution units, but such fused instructions still count as two for Uops. Chromium's scalar implementation is a decent application of the ldr, crc32x primitive, with three scalar accumulators sufficient to match the latency of 3. Chromium's vector implementation is a poor application of the ldr, pmull, pmull2, eor, eor primitive, as the four vector accumulators are insufficient to match the latency of 7. Our v4_v1 does better by using the ldr, pmull+eor, pmull2+eor primitive, but the four vector accumulators are still insufficient to match the latency of 6. Dougallj does a good application of the ldr, pmull+eor, pmull2+eor primitive, using 12 vector accumulators to match the latency. The fun twist is the ldr, pmull, pmull2, eor3 primitive; it scores worse than ldr, pmull+eor, pmull2+eor, so if just one primitive is used, ldr, pmull, pmull2, eor3 will lose out to ldr, pmull+eor, pmull2+eor, but ldr, pmull, pmull2, eor3 is less demanding on Uops, so a blend of ldr, pmull, pmull2, eor3 with ldr, crc32x can come out ahead. The v9s3x2e_s3 implementation blends nine copies of ldr, pmull, pmull2, eor3 with six copies of ldr, crc32x, which gives Uops = 48/8, Scalar = 6/1, Vector = 27/4, Bytes = 192, and thus a score of 28.4, which puts it just ahead of ldr, pmull+eor, pmull2+eor.

GCP Tau T2A (Ampere Altra Arm) performance (single core)

Implementation Speed Our equivalent Speed
Chromium scalar 22.68 GB/s -i neon -p crc32 -a s3k95760_s3 23.57 GB/s
Chromium vector 16.91 GB/s -i neon -p crc32 -a v4_v1 20.94 GB/s
dougallj Faster CRC32 17.41 GB/s -i neon -p crc32 -a v12e_v1 21.81 GB/s
fusion-inspired N/A -i neon -p crc32 -a v3s4x2e_v2 35.87 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_neon_crc32_s3k95760_s3.so chromium_neon_crc32_v4_v1.so dougallj_neon_crc32_v12e_v1.so
$ ./bench -r 40 third_party/chromium_neon_crc32_s3k95760_s3.so third_party/chromium_neon_crc32_v4_v1.so third_party/dougallj_neon_crc32_v12e_v1.so
$ ./autobench -r 40 -i neon -p crc32 -a s3k95760_s3,v4_v1,v12e_v1,v3s4x2e_v2

The CPU datasheet says:

Four-wide superscalar aggressive out-of-order execution CPU

Dual full-width (128-bit wide) SIMD execution pipes

Meanwhile, microarchitecture measurements suggest that:

  • crc32x / crc32cx have latency 2, throughput 1
  • pmull has latency 2, throughput 1
  • vector eor has latency 2, throughput 0.5

These latency and throughput numbers are consistent with Chromium scalar being faster than Chromium vector. The optimal fusion also involves more bytes being processed by the scalar pipes (64 bytes per iteration) than the vector pipes (48 bytes per iteration).

x86_64 Cascade Lake performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
stepping        : 7
microcode       : 0x5003006
cpu MHz         : 2199.998
cache size      : 16384 KB
Implementation Speed Our equivalent Speed
Chromium vector 15.87 GB/s -i sse -p crc32 -a v4_v1 16.48 GB/s
crc32_4k_three_way 15.04 GB/s -i sse -p crc32c -a s3k4096e 15.11 GB/s
crc32_4k_pclmulqdq 15.78 GB/s -i sse -p crc32c -a v4k4096e 15.63 GB/s
retuned crc32_4k_pclmulqdq N/A -i sse -p crc32c -a v4e 16.03 GB/s
AVX512 crc32_4k_pclmulqdq N/A -i avx512 -p crc32c -a v4e 17.36 GB/s
crc32_4k_fusion 25.32 GB/s -i sse -p crc32c -a v4s3x3k4096e 25.42 GB/s
retuned crc32_4k_fusion N/A -i sse -p crc32c -a v8s3x3 28.55 GB/s
AVX512 crc32_4k_fusion N/A -i avx512 -p crc32c -a v9s3x4e 31.55 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,v4k4096e,v4e -i avx512 -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512 -a v9s3x4e

Note that Cascade Lake lacks VPCLMULQDQ, so the only useful part of AVX512 is VPTERNLOGQ, hence the only very limited performance increase from AVX512 here.

x86_64 Ice Lake performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
stepping        : 6
microcode       : 0xffffffff
cpu MHz         : 2600.028
cache size      : 55296 KB
Implementation Speed Our equivalent Speed
Chromium vector 24.08 GB/s -i sse -p crc32 -a v4_v1 24.02 GB/s
Chromium vector AVX512 41.00 GB/s -i avx512_vpclmulqdq -p crc32 -a v4_v1 47.31 GB/s
crc32_4k_three_way 19.68 GB/s -i sse -p crc32c -a s3k4096e 19.80 GB/s
retuned crc32_4k_three_way N/A -i sse -p crc32c -a s3 22.50 GB/s
crc32_4k_pclmulqdq 22.80 GB/s -i sse -p crc32c -a v4k4096e 22.80 GB/s
retuned crc32_4k_pclmulqdq N/A -i sse -p crc32c -a v4e 24.05 GB/s
AVX512 crc32_4k_pclmulqdq N/A -i avx512_vpclmulqdq -p crc32c -a v4x2e 47.50 GB/s
crc32_4k_fusion 35.12 GB/s -i sse -p crc32c -a v4s3x3k4096e 34.72 GB/s
retuned crc32_4k_fusion N/A -i sse -p crc32c -a v7s3x3 42.58 GB/s
AVX512 crc32_4k_fusion N/A -i avx512_vpclmulqdq -p crc32c -a v4s5x3 63.98 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v7s3x3 -i avx512_vpclmulqdq -a v4s5x3

Ice Lake introduces VPCLMULQDQ, but VPCLMULQDQ on 64-byte registers is 1*p05+2*p5 (versus regular PCLMULQDQ on 16-byte registers being 1*p5), so going from 16-byte to 64-byte registers only gives a ~2x speedup. Chromium vector AVX512 leaves some performance on the table due to not using VPTERNLOGQ.

x86_64 Sapphire Rapids performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 143
model name      : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping        : 8
microcode       : 0xffffffff
cpu MHz         : 2699.998
cache size      : 107520 KB
Implementation Speed Our equivalent Speed
Chromium vector 23.42 GB/s -i sse -p crc32 -a v4_v1 23.31 GB/s
Chromium vector AVX512 65.20 GB/s -i avx512_vpclmulqdq -p crc32 -a v4_v1 94.00 GB/s
(with aligned inputs) 81.68 GB/s -i avx512_vpclmulqdq -p crc32 -a v4_v1 94.00 GB/s
crc32_4k_three_way 17.45 GB/s -i sse -p crc32c -a s3k4096e 17.59 GB/s
retuned crc32_4k_three_way N/A -i sse -p crc32c -a s3 23.84 GB/s
crc32_4k_pclmulqdq 22.79 GB/s -i sse -p crc32c -a v4k4096e 22.79 GB/s
retuned crc32_4k_pclmulqdq N/A -i sse -p crc32c -a v4e 23.29 GB/s
AVX512 crc32_4k_pclmulqdq N/A -i avx512_vpclmulqdq -p crc32c -a v4e 94.84 GB/s
crc32_4k_fusion 33.87 GB/s -i sse -p crc32c -a v4s3x3k4096e 34.68 GB/s
retuned crc32_4k_fusion N/A -i sse -p crc32c -a v8s3x3 36.98 GB/s
AVX512 crc32_4k_fusion N/A -i avx512_vpclmulqdq -p crc32c -a v3s1_s3 97.30 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e,v4e -i avx512_vpclmulqdq -a v4e -i sse -a v4s3x3k4096e,v8s3x3 -i avx512_vpclmulqdq -a v3s1_s3

Sapphire Rapids improves VPCLMULQDQ on 64-byte registers to 1*p5, meaning that AVX512 implementations can approach 4x the performance of implementations using 16-byte vectors (and slightly exceed 4x when also gaining VPTERNLOGQ). One catch is that we now observe a 20% performance penalty for unaligned 64-byte loads, which the Chromium code doesn't take care to handle.

x86_64 AMD Rome performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7B12
stepping        : 0
microcode       : 0xffffffff
cpu MHz         : 2249.998
cache size      : 512 KB
Implementation Speed Our equivalent Speed
Chromium vector 12.89 GB/s -i sse -p crc32 -a v4_v1 12.84 GB/s
crc32_4k_three_way 23.52 GB/s -i sse -p crc32c -a s3k4096e 23.08 GB/s
crc32_4k_pclmulqdq 12.86 GB/s -i sse -p crc32c -a v4k4096e 12.89 GB/s
crc32_4k_fusion 23.84 GB/s -i sse -p crc32c -a v4s3x3k4096e 22.62 GB/s
retuned crc32_4k_fusion N/A -i sse -p crc32c -a v1s3x3 31.16 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s3x3

x86_64 AMD Milan performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7B13
stepping        : 0
microcode       : 0xffffffff
cpu MHz         : 2450.000
cache size      : 512 KB
Implementation Speed Our equivalent Speed
Chromium vector 12.82 GB/s -i sse -p crc32 -a v4_v1 12.85 GB/s
crc32_4k_three_way 22.75 GB/s -i sse -p crc32c -a s3k4096e 23.55 GB/s
crc32_4k_pclmulqdq 12.76 GB/s -i sse -p crc32c -a v4k4096e 12.86 GB/s
crc32_4k_fusion 23.41 GB/s -i sse -p crc32c -a v4s3x3k4096e 22.64 GB/s
retuned crc32_4k_fusion N/A -i sse -p crc32c -a v1s4x2 31.76 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -p crc32c -a s3k4096e,v4k4096e,v4s3x3k4096e,v1s4x2

x86_64 AMD Genoa performance (single core)

Relevant /proc/cpuinfo of test machine:

vendor_id       : AuthenticAMD
cpu family      : 25
model           : 17
model name      : AMD EPYC 9B14
stepping        : 1
microcode       : 0xffffffff
cpu MHz         : 2599.998
cache size      : 1024 KB
Implementation Speed Our equivalent Speed
Chromium vector 13.73 GB/s -i sse -p crc32 -a v4_v1 13.72 GB/s
Chromium vector AVX512 51.70 GB/s -i avx512_vpclmulqdq -p crc32 -a v4_v1 54.00 GB/s
crc32_4k_three_way 26.27 GB/s -i sse -p crc32c -a s3k4096e 26.56 GB/s
retuned crc32_4k_three_way N/A -i sse -p crc32c -a s3 27.32 GB/s
crc32_4k_pclmulqdq 13.73 GB/s -i sse -p crc32c -a v4k4096e 13.74 GB/s
AVX512 crc32_4k_pclmulqdq N/A -i avx512_vpclmulqdq -p crc32c -a v3x2 54.77 GB/s
crc32_4k_fusion 28.47 GB/s -i sse -p crc32c -a v4s3x3k4096e 28.56 GB/s
retuned crc32_4k_fusion N/A -i sse -p crc32c -a v1s3x2 36.20 GB/s
AVX512 crc32_4k_fusion N/A -i avx512_vpclmulqdq -p crc32c -a v3s2x4 71.95 GB/s

To reproduce these measurements:

$ make bench autobench
$ make -C third_party chromium_sse_crc32_v4_v1.so chromium_avx512_vpclmulqdq_crc32_v4_v1.so corsix4k_sse_crc32c_s3k4096e.so corsix4k_sse_crc32c_v4k4096e.so corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./bench -r 40 third_party/chromium_sse_crc32_v4_v1.so third_party/chromium_avx512_vpclmulqdq_crc32_v4_v1.so third_party/corsix4k_sse_crc32c_s3k4096e.so third_party/corsix4k_sse_crc32c_v4k4096e.so third_party/corsix4k_sse_crc32c_v4s3x3k4096e.so
$ ./autobench -r 40 -i sse -p crc32 -a v4_v1 -i avx512_vpclmulqdq -p crc32 -a v4_v1 -i sse -p crc32c -a s3k4096e,s3,v4k4096e -i avx512_vpclmulqdq -a v3x2 -i sse -a v4s3x3k4096e,v1s3x2 -i avx512_vpclmulqdq -a v3s2x4

Related reading

Background mathematics:

CRC implementations: