performance of avx512 bit ops and popcounts

Question

performance of avx512 bit ops and popcounts

brentp opened this issue 5 years ago · 4 comments

as requested, I am opening an issue.
somalier calculates relatedness between pairs of samples using bitwise operations and popcounts here

where genotypes is effectively:

type genotypes* = tuple[hom_ref:seq[uint64], het:seq[uint64], hom_alt:seq[uint64]]

and currently, those seqs have a len of about 300.
so for 10K samples, doing relatedness for all-vs-all, it will do ~50 million calls to the IBS function linked above.

the "simple" avx512 version is here which is identical in speed to the version currently in somalier

the unrolled version is here. this gives ~10-15% speedup.

this problem is embarrassingly parallel and it's currently single-threaded, so I could also improve speed that way, but I was interested to explore the simd stuff.

I'd be interested to hear your thoughts, maybe the "parallelization" scheme should be changed to load 8 samples at once instead of current 8 (*64) sites at once.

Answer 1 · 2019-11-07T17:28:39.000Z

Since its integers the compiler has the liberty to reorder lots of things while in floating point it can't because addition is not asociative so the usual suspects that slow down reductions do not exist.

In conclusion like you, I think fast popcnt might be the bottleneck.

Note on multithreading:
Another potential bottleneck is memory because your algorithm needs to iterate
on 6 sequences at the same time. But 300x6 is probably to small to benefits from multithreading see: https://github.com/zy97140/omp-benchmark-for-pytorch as I don't think popcnt is more CPU intensive than exp or sin.

Bench on my CPU

1.355095338 AVX512 got ibs0:21482500000 ibs2:28522500000 hets:36842500000 N:37740000000
1.918461567 got ibs0:21482500000 ibs2:28522500000 hets:36842500000 N:37740000000

Frequencies: scalar code 4.1 GHz, AVX512 3.5 Ghz (liquid cooling, I think for air-cooled CPU you are probably looking at a 1GHz downclocking).

It's important to know if your CPU is a Skylake-X, Xeon-W or Xeon Gold which have 2 AVX512 units per core or a Xeon Bronze / Silver which only have 1.

Now regarding popcnt, I did a bit of research and found the following:

paper: https://arxiv.org/pdf/1611.07612.pdf
bench: http://0x80.pl/articles/sse-popcount.html
repo: https://github.com/WojciechMula/sse-popcount

I've forked the repo so that on AVX512 it doesn't use software emulation: https://github.com/mratsim/sse-popcount/tree/dont-use-avx512-emulation

You can run make run_avx512 after cloning to bench the speed on your computer. Here is mine:

./speed_avx512bw_g++ 1000000 100
lookup-8                              ... time = 0.032190 s
lookup-64                             ... time = 0.025103 s (speedup: 1.28)
bit-parallel                          ... time = 0.025244 s (speedup: 1.28)
bit-parallel-optimized                ... time = 0.016745 s (speedup: 1.92)
bit-parallel-mul                      ... time = 0.015128 s (speedup: 2.13)
bit-parallel32                        ... time = 0.041638 s (speedup: 0.77)
bit-parallel-optimized32              ... time = 0.030790 s (speedup: 1.05)
harley-seal                           ... time = 0.006844 s (speedup: 4.70)
sse-bit-parallel                      ... time = 0.008196 s (speedup: 3.93)
sse-bit-parallel-original             ... time = 0.008840 s (speedup: 3.64)
sse-bit-parallel-better               ... time = 0.006941 s (speedup: 4.64)
sse-harley-seal                       ... time = 0.003334 s (speedup: 9.65)
sse-lookup                            ... time = 0.004037 s (speedup: 7.97)
sse-lookup-original                   ... time = 0.005569 s (speedup: 5.78)
avx2-lookup                           ... time = 0.002572 s (speedup: 12.52)
avx2-lookup-original                  ... time = 0.003401 s (speedup: 9.46)
avx2-harley-seal                      ... time = 0.001867 s (speedup: 17.24)
cpu                                   ... time = 0.003142 s (speedup: 10.25)
sse-cpu                               ... time = 0.004497 s (speedup: 7.16)
avx2-cpu                              ... time = 0.004806 s (speedup: 6.70)
avx512-harley-seal                    ... time = 0.001112 s (speedup: 28.94)
avx512bw-shuf                         ... time = 0.001707 s (speedup: 18.86)
builtin-popcnt                        ... time = 0.004645 s (speedup: 6.93)
builtin-popcnt32                      ... time = 0.012765 s (speedup: 2.52)
builtin-popcnt-unrolled               ... time = 0.003129 s (speedup: 10.29)
builtin-popcnt-unrolled32             ... time = 0.007560 s (speedup: 4.26)
builtin-popcnt-unrolled-errata        ... time = 0.003150 s (speedup: 10.22)
builtin-popcnt-unrolled-errata-manual ... time = 0.003153 s (speedup: 10.21)
builtin-popcnt-movdq                  ... time = 0.004647 s (speedup: 6.93)
builtin-popcnt-movdq-unrolled         ... time = 0.003644 s (speedup: 8.83)
builtin-popcnt-movdq-unrolled_manual  ... time = 0.004313 s (speedup: 7.46)

Then in terms of implementations, given that there are many variable to track I expect that for your need AVX512 has a lot of advantages over AVX2 even with the downclocking due to 32 AVX512 registers versus only 16 for AVX2. The fastest popcnt on my machine, Harley Seal, needs 12 registers per popcnt https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L39-L45

Details of the implementations

Assuming Harley seal is also the fastest on your machine here are some details:

The wrapping function

The popcnt is done 64 by 64
https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L88

And the remainder byte by byte: https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L90-L91

The vectorized loop

Here we stay in the m512i domain including for total, you want to have your for loop here:

https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L38-L69

This is unrolled by 16

So you need 4 AVX512 running totals.

Sum reduction of the running total
The running total is reduced (following some algorithm probably in the paper), happening here:
https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L71-L75
Then for the up to 15 remaining items we do the byte by byte popcount
https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L77-L78
Then the reduction of all the totals

https://github.com/WojciechMula/sse-popcount/blob/201fadc4580f785e394575a4af7d475135d4f7b1/popcnt-avx512-harley-seal.cpp#L81

Answer 2 · 2019-11-07T21:09:01.000Z

wow. thanks for taking a look. here is what I see:

lookup-8                              ... time = 0.067491 s
lookup-64                             ... time = 0.056205 s (speedup: 1.20)
bit-parallel                          ... time = 0.050526 s (speedup: 1.34)
bit-parallel-optimized                ... time = 0.032638 s (speedup: 2.07)
bit-parallel-mul                      ... time = 0.030241 s (speedup: 2.23)
bit-parallel32                        ... time = 0.081122 s (speedup: 0.83)
bit-parallel-optimized32              ... time = 0.060367 s (speedup: 1.12)
harley-seal                           ... time = 0.013470 s (speedup: 5.01)
sse-bit-parallel                      ... time = 0.015314 s (speedup: 4.41)
sse-bit-parallel-original             ... time = 0.018955 s (speedup: 3.56)
sse-bit-parallel-better               ... time = 0.013494 s (speedup: 5.00)
sse-harley-seal                       ... time = 0.006740 s (speedup: 10.01)
sse-lookup                            ... time = 0.008216 s (speedup: 8.21)
sse-lookup-original                   ... time = 0.012974 s (speedup: 5.20)
avx2-lookup                           ... time = 0.005621 s (speedup: 12.01)
avx2-lookup-original                  ... time = 0.007585 s (speedup: 8.90)
avx2-harley-seal                      ... time = 0.004249 s (speedup: 15.89)
cpu                                   ... time = 0.006446 s (speedup: 10.47)
sse-cpu                               ... time = 0.008706 s (speedup: 7.75)
avx2-cpu                              ... time = 0.010263 s (speedup: 6.58)
avx512-harley-seal                    ... time = 0.002780 s (speedup: 24.28)
avx512bw-shuf                         ... time = 0.004230 s (speedup: 15.95)
builtin-popcnt                        ... time = 0.009166 s (speedup: 7.36)
builtin-popcnt32                      ... time = 0.024966 s (speedup: 2.70)
builtin-popcnt-unrolled               ... time = 0.006413 s (speedup: 10.52)
builtin-popcnt-unrolled32             ... time = 0.014971 s (speedup: 4.51)
builtin-popcnt-unrolled-errata        ... time = 0.006482 s (speedup: 10.41)
builtin-popcnt-unrolled-errata-manual ... time = 0.006547 s (speedup: 10.31)
builtin-popcnt-movdq                  ... time = 0.009178 s (speedup: 7.35)
builtin-popcnt-movdq-unrolled         ... time = 0.007436 s (speedup: 9.08)
builtin-popcnt-movdq-unrolled_manual  ... time = 0.008399 s (speedup: 8.04)

As I understand it, the harley-seal will require a substantial change to the code. And I haven't figure out how to keep an m512i accumulating into an array (I think that's what you mean). I guess because the size is unknown... I'll revisit next week. Thanks again.

Answer 3 · 2019-11-08T10:21:58.000Z

You don't need to invent anything, the algorithm accumulate in one total already. You just need to have 4 variables vibs0, vibs2, vnhets, vnN instead of a single total.

Answer 4 · 2019-11-10T09:31:55.000Z

Another biologist exploring the popcount rabbit hole: http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html