Request: target AVX512

Question

Request: target AVX512

HJLebbink opened this issue 7 years ago · 2 comments

Maybe a strange request: could you make a target for AVX512; that is, could you generate something like:

    // code=0x02, function=(C and (B nor A)), lowered=((B or A) notand C), set=intel
    template<> inline __m512i ternary<0x02>(const __m512i A, const __m512i B, const __m512i C) {
        const __m512i t0 = _mm512_or_si512(B, A);
        const __m512i t1 = _mm512_andnot_si512(t0, C);
        return t1;
    }

I would have done it myself, but I can't seem to understand the Python code...

Why would one need this: Vpternlog has lower throughput compared to 'ands', 'ors' and 'xors' on my Skylake X. If one optimizes for speed (and throughput), vpternlog may not always the best choice.

Answer 1 · 2018-01-05T18:21:02.000Z

Done, please check it out.

That's interesting, would you share some benchmarks?

Answer 2 · 2018-01-08T09:36:33.000Z

It is more complex to measure speed differences correctly than I hoped for.

The (minor) speed increase I experienced in my application is probably due to the destination register in vpternlog being used as input. I assumed eg. that BF3[0xA] = (C and not (A)) is not faster with _mm512_andnot_si512(A, C) compared to _mm512_ternarylogic_epi32(A, B, C, 0xA) but I suspect it has lower throughput, yet cannot measure it correctly. If I have figured out how to determine what is better, I'll let you know.