WojciechMula/sse-popcount

AVX2 build is now broken on GCC 4.9.2

WojciechMula opened this issue · 8 comments

GCC 5.3.1 works fine.

@kimwalisch

I have added a shorter AVX2 HS...

CountOnes/hamming_weight@2ab07ac

Predictably, the result is slower for long arrays. Of course, if you have short arrays, it might be preferable.

Awesome, thanks!

@kimwalisch We also tested 5th iteration, but with AVX512F. It's faster for longer inputs.

@WojciechMula I cannot find any popcount benchmark results for AVX512. Have you benchmarked it? Is AVX512 popcount faster than AVX2?

@kimwalisch

Is AVX512 popcount faster than AVX2?

This question is not well posed. Currently, the only available hardware where AVX512 runs is Knights Landing, and it is a system optimized for AVX-512 execution.

@kimwalisch We haven't published any benchmark yet, but for sure AVX512 is faster than AVX2. Will post numbers when I'm back home.

And as Daniel said, AVX512 is the main instruction set on KNL. Many AVX2 instructions that are really fast on Skylake and other popular desktop CPUs, on KNL are incredibly slow. Take a look at the latest documents from Agner Fog http://agner.org/optimize/#manuals and compare instructions timing. For example on Skylake PSHUFB both latency and throughput are 1 cycle, on KNL it is 11 cycles (and VPSHUFB is two times slower).

@kimwalisch The metric we're using in the project Daniel linked is CPU cycles per 64-bit word. Going to the numbers: popcount of 8192 words, the fastest AVX2 procedure runs at rate 1.12 cycle, while AVX512F runs at 0.33. More than 3 times faster.