Request: target AVX512
HJLebbink opened this issue · 2 comments
Maybe a strange request: could you make a target for AVX512; that is, could you generate something like:
// code=0x02, function=(C and (B nor A)), lowered=((B or A) notand C), set=intel
template<> inline __m512i ternary<0x02>(const __m512i A, const __m512i B, const __m512i C) {
const __m512i t0 = _mm512_or_si512(B, A);
const __m512i t1 = _mm512_andnot_si512(t0, C);
return t1;
}
I would have done it myself, but I can't seem to understand the Python code...
Why would one need this: Vpternlog has lower throughput compared to 'ands', 'ors' and 'xors' on my Skylake X. If one optimizes for speed (and throughput), vpternlog may not always the best choice.
Done, please check it out.
That's interesting, would you share some benchmarks?
It is more complex to measure speed differences correctly than I hoped for.
The (minor) speed increase I experienced in my application is probably due to the destination register in vpternlog being used as input. I assumed eg. that BF3[0xA] = (C and not (A)) is not faster with _mm512_andnot_si512(A, C)
compared to _mm512_ternarylogic_epi32(A, B, C, 0xA)
but I suspect it has lower throughput, yet cannot measure it correctly. If I have figured out how to determine what is better, I'll let you know.