nfrechette/rtm

Add SSE fused multiply-add

Closed this issue · 2 comments

Hi Nick,
Thanks for the great library. I realize that it does not support SSE intrinsic for fused vector mul-add yet.

_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()

Thank you for creating this issue. I don't currently use x64 FMA yet in my own code but I will get around to it eventually, especially once newer consoles start supporting it (they will most likely support it as CPUs have had support for it for quite a few years now, roughly around the time that the current generation first came out).

I added support for this and sadly, FMA is always slower in the tests I ran. See benchmarks added in the commit.

Run on Ryzen 2950X (32 X 3500 MHz CPU s):

Benchmark AVX2 Time CPU Iterations
bm_quat_mul_scalar 7.48 ns 7.50 ns 89600000
bm_quat_mul_fma_mul 7.50 ns 7.50 ns 89600000
bm_quat_mul_fma_xor 6.38 ns 6.42 ns 112000000
bm_quat_mul_sse_mul 6.43 ns 6.56 ns 112000000
bm_quat_mul_sse_xor 6.04 ns 6.00 ns 112000000
bm_quat_mul_vector3_ref 9.90 ns 10.0 ns 74666667
bm_quat_mul_vector3_fma 11.2 ns 11.2 ns 56000000
bm_quat_mul_vector3_sse2 9.23 ns 9.21 ns 74666667

I also measured on my MacBook Pro (4 X 2600 MHz Haswell CPU s) and the numbers are similarly bad when FMA intrinsics are used instead of separate MUL/ADD. While it reduces the number of registers and instructions used, it remains significantly slower. You can easily run the benchmark yourself to validate this with a command line as follow:

python make.py -compiler vs2017 -avx2 -unit_test -bench -build -clean