Add SSE fused multiply-add
Closed this issue · 2 comments
Hi Nick,
Thanks for the great library. I realize that it does not support SSE intrinsic for fused vector mul-add yet.
_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()
Thank you for creating this issue. I don't currently use x64 FMA yet in my own code but I will get around to it eventually, especially once newer consoles start supporting it (they will most likely support it as CPUs have had support for it for quite a few years now, roughly around the time that the current generation first came out).
I added support for this and sadly, FMA is always slower in the tests I ran. See benchmarks added in the commit.
Run on Ryzen 2950X (32 X 3500 MHz CPU s):
Benchmark AVX2 | Time | CPU | Iterations |
---|---|---|---|
bm_quat_mul_scalar | 7.48 ns | 7.50 ns | 89600000 |
bm_quat_mul_fma_mul | 7.50 ns | 7.50 ns | 89600000 |
bm_quat_mul_fma_xor | 6.38 ns | 6.42 ns | 112000000 |
bm_quat_mul_sse_mul | 6.43 ns | 6.56 ns | 112000000 |
bm_quat_mul_sse_xor | 6.04 ns | 6.00 ns | 112000000 |
bm_quat_mul_vector3_ref | 9.90 ns | 10.0 ns | 74666667 |
bm_quat_mul_vector3_fma | 11.2 ns | 11.2 ns | 56000000 |
bm_quat_mul_vector3_sse2 | 9.23 ns | 9.21 ns | 74666667 |
I also measured on my MacBook Pro (4 X 2600 MHz Haswell CPU s) and the numbers are similarly bad when FMA intrinsics are used instead of separate MUL/ADD. While it reduces the number of registers and instructions used, it remains significantly slower. You can easily run the benchmark yourself to validate this with a command line as follow:
python make.py -compiler vs2017 -avx2 -unit_test -bench -build -clean