Inaccuracy for f32 erf on non-FMA instruction set
bnprks opened this issue · 0 comments
Thanks for writing and maintaining such a fast and useful library!
I've noticed some precision issues where the erf function with f32 inputs may exceed the documented 1.0 ULP error bound when FMA is not available.
I ran my testing on x86, where scalar and AVX2 return consistently accurate values, but SSE4 and SSE2 sometimes return inaccurate values.
For example, the input 0x1.16945ap-126
(1.279174322e-38
) has a 1.3 ULP error, and the input -0x1.d1b6c8p-127
(-1.069226882e-38
) has 1.8 ULP error.
Here is some example code to reproduce these results, tested on a fresh pull of the main branch (a99491a):
#include <stdio.h>
#include <math.h>
#include "sleef.h"
//cmake -B build -S . -G Ninja
//cmake --build build
//gcc -o sleef_erf_example sleef_erf_example.c -lsleef -lm -L./build/lib -I ./build/include -march=native
int main() {
float example_input = 0x1.16945ap-126;
printf("Problem input:\t%1$.10g (%1$a)\n", example_input);
printf("Sleef scalar:\t%1$.10g (%1$a)\n", Sleef_erff1_u10(example_input));
printf("Sleef avx2:\t%1$.10g (%1$a)\n", _mm256_cvtss_f32(Sleef_erff8_u10avx2(_mm256_set1_ps(example_input))));
printf("Sleef sse4:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse4(_mm_set1_ps(example_input))));
printf("Sleef sse2:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse2(_mm_set1_ps(example_input))));
printf("Correct result:\t%1$.20Lg (%1$La)\n", erfl((long double) example_input));
}
When run, this outputs:
Problem input: 1.279174322e-38 (0x1.16945ap-126)
Sleef scalar: 1.44339361e-38 (0x1.3a57e2p-126)
Sleef avx2: 1.44339361e-38 (0x1.3a57e2p-126)
Sleef sse4: 1.44339347e-38 (0x1.3a57ep-126)
Sleef sse2: 1.44339347e-38 (0x1.3a57ep-126)
Correct result: 1.4433936563104040097e-38 (0x9.d2bf154036c0db7p-129)
It appears the differences arise in the call to dfmul_vf2_vf2_vf
on this line 3524 of sleefsimdsp.c
Line 3524 in a99491a
An algorithm update to improve non-FMA precision would of course be appreciated. For my use-case, however, I don't need the full 1.0 ULP precision. It would be fine to just update the docs to state that the error bound is 2.0 ULP when FMA isn't available.