Add SSE fused multiply-add

Question

Add SSE fused multiply-add

Closed this issue 5 years ago · 2 comments

Hi Nick,
Thanks for the great library. I realize that it does not support SSE intrinsic for fused vector mul-add yet.

_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()

Answer 1 · 2019-07-09T13:10:12.000Z

Thank you for creating this issue. I don't currently use x64 FMA yet in my own code but I will get around to it eventually, especially once newer consoles start supporting it (they will most likely support it as CPUs have had support for it for quite a few years now, roughly around the time that the current generation first came out).

Answer 2 · 2019-11-01T02:40:20.000Z

I added support for this and sadly, FMA is always slower in the tests I ran. See benchmarks added in the commit.

Run on Ryzen 2950X (32 X 3500 MHz CPU s):

Benchmark AVX2	Time	CPU	Iterations
bm_quat_mul_scalar	7.48 ns	7.50 ns	89600000
bm_quat_mul_fma_mul	7.50 ns	7.50 ns	89600000
bm_quat_mul_fma_xor	6.38 ns	6.42 ns	112000000
bm_quat_mul_sse_mul	6.43 ns	6.56 ns	112000000
bm_quat_mul_sse_xor	6.04 ns	6.00 ns	112000000
bm_quat_mul_vector3_ref	9.90 ns	10.0 ns	74666667
bm_quat_mul_vector3_fma	11.2 ns	11.2 ns	56000000
bm_quat_mul_vector3_sse2	9.23 ns	9.21 ns	74666667

I also measured on my MacBook Pro (4 X 2600 MHz Haswell CPU s) and the numbers are similarly bad when FMA intrinsics are used instead of separate MUL/ADD. While it reduces the number of registers and instructions used, it remains significantly slower. You can easily run the benchmark yourself to validate this with a command line as follow:

python make.py -compiler vs2017 -avx2 -unit_test -bench -build -clean