Inaccuracy for f32 erf on non-FMA instruction set

Thanks for writing and maintaining such a fast and useful library!

I've noticed some precision issues where the erf function with f32 inputs may exceed the documented 1.0 ULP error bound when FMA is not available.

I ran my testing on x86, where scalar and AVX2 return consistently accurate values, but SSE4 and SSE2 sometimes return inaccurate values.

For example, the input 0x1.16945ap-126 (1.279174322e-38) has a 1.3 ULP error, and the input -0x1.d1b6c8p-127 (-1.069226882e-38) has 1.8 ULP error.

Here is some example code to reproduce these results, tested on a fresh pull of the main branch (a99491a):

#include <stdio.h>
#include <math.h>

#include "sleef.h"

//cmake -B build -S . -G Ninja
//cmake --build build
//gcc -o sleef_erf_example sleef_erf_example.c -lsleef -lm -L./build/lib -I ./build/include -march=native

int main() {
    float example_input = 0x1.16945ap-126;
    printf("Problem input:\t%1$.10g (%1$a)\n", example_input);
    printf("Sleef scalar:\t%1$.10g (%1$a)\n", Sleef_erff1_u10(example_input));
    printf("Sleef avx2:\t%1$.10g (%1$a)\n", _mm256_cvtss_f32(Sleef_erff8_u10avx2(_mm256_set1_ps(example_input))));
    printf("Sleef sse4:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse4(_mm_set1_ps(example_input))));
    printf("Sleef sse2:\t%1$.10g (%1$a)\n", _mm_cvtss_f32(Sleef_erff4_u10sse2(_mm_set1_ps(example_input))));
    printf("Correct result:\t%1$.20Lg (%1$La)\n", erfl((long double) example_input));
}

When run, this outputs:

Problem input:  1.279174322e-38 (0x1.16945ap-126)
Sleef scalar:   1.44339361e-38 (0x1.3a57e2p-126)
Sleef avx2:     1.44339361e-38 (0x1.3a57e2p-126)
Sleef sse4:     1.44339347e-38 (0x1.3a57ep-126)
Sleef sse2:     1.44339347e-38 (0x1.3a57ep-126)
Correct result: 1.4433936563104040097e-38 (0x9.d2bf154036c0db7p-129)

It appears the differences arise in the call to dfmul_vf2_vf2_vf on this line 3524 of sleefsimdsp.c

sleef/src/libm/sleefsimdsp.c

Line 3524 in a99491a

    
           t2 = vsel_vf2_vo_vf2_vf2(vlt_vo_vf_vf(x, vcast_vf_f(1e-4)), dfmul_vf2_vf2_vf(vcast_vf2_f_f(-1.1283792257308959961, 5.8635383422197591097e-08), x), t2);

An algorithm update to improve non-FMA precision would of course be appreciated. For my use-case, however, I don't need the full 1.0 ULP precision. It would be fine to just update the docs to state that the error bound is 2.0 ULP when FMA isn't available.