AVX2 build is now broken on GCC 4.9.2

Question

AVX2 build is now broken on GCC 4.9.2

WojciechMula opened this issue 9 years ago · 8 comments

GCC 5.3.1 works fine.

Answer 1 · 2016-11-28T20:20:52.000Z

Hi Wojciech, I want to test your AVX2 popcount algorithm in my primecount algorithm. For this purpose I have created a new libpopcnt GitHub repository on which I will work in the next days. While looking at your popcnt-avx2-harley-seal.cpp I realise that you are using the 4th iteration of the Harley Seal algorithm. Looking at your benchmark results I realise that this algorithm is only fastest past 4096 bytes compared to an unrolled POPCNT algorithm. Maybe it would be wise to benchmark the 3rd iteration of the AVX2 Harley Seal algorithm, with some luck the algorithm would then run faster starting at 2048 bytes without deteriorating past 4096 bytes (because I think the algorithm hits the memory/cache botteneck) compared to the 4th iteration of the AVX2 Harley Seal algorithm. Here is an implementation of the 3rd iteration of the Harley Seal algorithm: https://github.com/kimwalisch/primecount/blob/master/include/popcount.hpp#L161 Best regards, Kim

…

On Fri, Oct 14, 2016 at 11:50 PM, Wojciech Muła ***@***.***> wrote: Closed #14 <#14> via 687a2f8 <687a2f8> . — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEcMW39rL-ba-7eeBCGrcfHchp9k6SgIks5qz_kIgaJpZM4H5edu> .

Answer 2 · 2016-11-28T21:34:32.000Z

@kimwalisch

I have added a shorter AVX2 HS...

CountOnes/hamming_weight@2ab07ac

Predictably, the result is slower for long arrays. Of course, if you have short arrays, it might be preferable.

Answer 3 · 2016-11-29T07:45:39.000Z

Awesome, thanks!

Answer 4 · 2016-11-29T18:51:52.000Z

@kimwalisch We also tested 5th iteration, but with AVX512F. It's faster for longer inputs.

Answer 5 · 2016-11-29T22:04:04.000Z

@WojciechMula I cannot find any popcount benchmark results for AVX512. Have you benchmarked it? Is AVX512 popcount faster than AVX2?

Answer 6 · 2016-11-29T22:09:21.000Z

@kimwalisch

Is AVX512 popcount faster than AVX2?

This question is not well posed. Currently, the only available hardware where AVX512 runs is Knights Landing, and it is a system optimized for AVX-512 execution.

Answer 7 · 2016-11-30T06:45:05.000Z

@kimwalisch We haven't published any benchmark yet, but for sure AVX512 is faster than AVX2. Will post numbers when I'm back home.

And as Daniel said, AVX512 is the main instruction set on KNL. Many AVX2 instructions that are really fast on Skylake and other popular desktop CPUs, on KNL are incredibly slow. Take a look at the latest documents from Agner Fog http://agner.org/optimize/#manuals and compare instructions timing. For example on Skylake PSHUFB both latency and throughput are 1 cycle, on KNL it is 11 cycles (and VPSHUFB is two times slower).

Answer 8 · 2016-11-30T19:44:32.000Z

@kimwalisch The metric we're using in the project Daniel linked is CPU cycles per 64-bit word. Going to the numbers: popcount of 8192 words, the fastest AVX2 procedure runs at rate 1.12 cycle, while AVX512F runs at 0.33. More than 3 times faster.