LewisGaul / zig-f128math

GCC calculates exp2() using exp() (base e). I'm not sure why it's done this way round, as I would have thought base 2 would be the more natural one to implement... See exp() section below.

Musl approach

Using a different method to others - based on https://github.com/ARM-software/optimized-routines/tree/master/math (since commit 3f94c648, 2017). Claims to achieve 1.8x with -O3 for expf() (not as good for exp2()).

Also based on the above (since commit e16f7b3c0, 2018). Claims to achieve similar (slightly better) improvements to single precision.

The implementation of quadruple precision comes from FreeBSD (lib/msun/ld80/s_exp2l.c and lib/msun/ld128/s_exp2l.c) - implementation depends on choice of format for long double via a #ifdef.

FreeBSD approach

Accuracy: Peak error < 0.501 ulp; location of peak: -0.030110927.

Method: (equally-spaced tables)

Reduce x: x = 2**k + y, for integer k and |y| <= 1/2. Thus we have exp2(x) = 2**k * exp2(y).

Reduce y: y = i/TBLSIZE + z for integer i near y * TBLSIZE. Thus we have exp2(y) = exp2(i/TBLSIZE) * exp2(z), with |z| <= 2**-(TBLSIZE+1).

We compute exp2(i/TBLSIZE) via table lookup and exp2(z) via a degree-4 minimax polynomial with maximum error under 1.4 * 2**-33. Using double precision for everything except the reduction makes roundoff error insignificant and simplifies the scaling step.

This method is due to Tang, but I do not use his suggested parameters:

Tang, P. Table-driven Implementation of the Exponential Function in IEEE Floating-Point Arithmetic. TOMS 15(2), 144-157 (1989).

Accuracy: Peak error < 0.503 ulp for normalized results.

Method: (accurate tables)

Reduce x: x = 2**k + y, for integer k and |y| <= 1/2. Thus we have exp2(x) = 2**k * exp2(y).

Reduce y: y = i/TBLSIZE + z - eps[i] for integer i near y * TBLSIZE. Thus we have exp2(y) = exp2(i/TBLSIZE) * exp2(z - eps[i]), with |z - eps[i]| <= 2**-9 + 2**-39 for the table used.

We compute exp2(i/TBLSIZE) via table lookup and exp2(z - eps[i]) via a degree-5 minimax polynomial with maximum error under 1.3 * 2**-61. The values in exp2t[] and eps[] are chosen such that exp2t[i] = exp2(i/TBLSIZE + eps[i]), and eps[i] is a small offset such that exp2t[i] is accurate to 2**-64.

Note that the range of i is +-TBLSIZE/2, so we actually index the tables by i0 = i + TBLSIZE/2. For cache efficiency, exp2t[] and eps[] are virtual tables, interleaved in the real table tbl[].

This method is due to Gal, with many details due to Gal and Bachelis:

Gal, S. and Bachelis, B. An Accurate Elementary Mathematical Library for the IEEE Floating Point Standard. TOMS 17(1), 26-46 (1991).

https://github.com/golang/go/blob/master/src/math/exp.go

Accuracy: Peak error < 0.502 ulp.

Method: (accurate tables)

Reduce x: x = 2**k + y, for integer k and |y| <= 1/2. Thus we have exp2(x) = 2**k * exp2(y).

Reduce y: y = i/TBLSIZE + z - eps[i] for integer i near y * TBLSIZE. Thus we have exp2(y) = exp2(i/TBLSIZE) * exp2(z - eps[i]), with |z - eps[i]| <= 2**-8 + 2**-98 for the table used.

We compute exp2(i/TBLSIZE) via table lookup and exp2(z - eps[i]) via a degree-10 minimax polynomial with maximum error under 2**-120. The values in exp2t[] and eps[] are chosen such that exp2t[i] = exp2(i/TBLSIZE + eps[i]), and eps[i] is a small offset such that exp2t[i] is accurate to 2**-122.

Note that the range of i is +-TBLSIZE/2, so we actually index the tables by i0 = i + TBLSIZE/2.

This method is due to Gal, with many details due to Gal and Bachelis:

Gal, S. and Bachelis, B. An Accurate Elementary Mathematical Library for the IEEE Floating Point Standard. TOMS 17(1), 26-46 (1991).

OpenLibm approach

The same as FreeBSD.

Go approach

Appears to be based on Sun Microsystems. Only has double precision implementation, doesn't use table of values?

Tang

Used by:

FreeBSD, OpenLibm (single precision)

Gal & Bachelis

Used by:

FreeBSD, OpenLibm (double/quadruple precision), Musl (quadruple precision)
- 10-degree polynomial for quad

A detailed description of the approach is given below for the range of 'normal' values, following the code snippet given.

fn exp2_64(x: f64) f64 {
    const tblsiz = @intCast(u32, exp2dt.len / 2);
    const redux: f64 = 0x1.8p52 / @intToFloat(f64, tblsiz);
    const P1: f64 = 0x1.62e42fefa39efp-1;
    const P2: f64 = 0x1.ebfbdff82c575p-3;
    const P3: f64 = 0x1.c6b08d704a0a6p-5;
    const P4: f64 = 0x1.3b2ab88f70400p-7;
    const P5: f64 = 0x1.5d88003875c74p-10;

    const ux = @bitCast(u64, x);
    const ix = @intCast(u32, ux >> 32) & 0x7FFFFFFF;

    // reduce x
    var uf: f64 = x + redux;
    // NOTE: musl performs an implicit 64-bit to 32-bit u32 truncation here
    var i_0: u32 = @truncate(u32, @bitCast(u64, uf));
    _ = @addWithOverflow(u32, i_0, tblsiz / 2, &i_0);

    const k: u32 = i_0 / tblsiz * tblsiz;
    const ik: i32 = @bitCast(i32, k / tblsiz);
    i_0 %= tblsiz;
    uf -= redux;

    // r = exp2(y) = exp2t[i_0] * p(z - eps[i])
    var z: f64 = x - uf;
    const t: f64 = exp2dt[@intCast(usize, 2 * i_0)];
    z -= exp2dt[@intCast(usize, 2 * i_0 + 1)];
    const r: f64 = t + t * z * (P1 + z * (P2 + z * (P3 + z * (P4 + z * P5))));

    return math.scalbn(r, ik);
}

Where does the redux value 0x1.8p52 come from?
- This is setting the top two bits (1.1000...), with the exponent set to the size of the mantissa (52 bits) and only the top bit in the mantissa set.
- When you add x this will discard any part of x that is too small to fit in the mantissa with the high-bit set.
Examples of the first step, where redux=@bitCast(f64,0x42b8000000000000) is added (with table size 256):
- x=1: uf=0x42b8000000000100 - no reduction
- x=0.511: uf=0x42b8000000000083 - the 0.01 is removed
... TODO ...

Other

Used by:

Musl (single/double precision)
- Based on https://github.com/ARM-software/optimized-routines
GCC (quadruple precision)
- Uses exp(), which uses Abraham Ziv's method

Function: `exp(x: float) -> float`

Return the result of e raised to the given argument.

We approach this as follows:

Handle special cases:
- exp(nan) -> nan
- exp(+inf) -> +inf
- exp(-inf) -> 0
- Input is sufficiently large/small such that we get an overflow/underflow (to inf or 0)
Reduce to r such that |r| <= 0.5*ln2, x = k*ln2 + r
Approximate solution using Chebyshev polynomial on a narrowed range
Scale up with scalbn(), which implements x * 2^k for integer exponent

This appears to be approached slightly differently to exp2() (and does not call out to exp2() in general, using scalbn() instead) - there are less tables of values and narrowing is done on a smaller range? Although it seems a table of values is needed for 128-bit floats to achieve the required precision...

This article may be of interest: https://www.pseudorandom.com/implementing-exp.

Quadruple precision will be based on FreeBSD (Musl appears to be missing a working implementation).

GCC approach

https://github.com/gcc-mirror/gcc/blob/master/libquadmath/math/expq.c

The design is taken from Abraham Ziv, "Fast Evaluation of Elementary Mathematical Functions with Correctly Rounded Last Bit", ACM Trans. Math. Soft., 17 (3), September 1991, pp. 410-423.

The input is split into a high part and a low part. Calculations with the high part must be exact.

The input value is written as

x = n * ln(2)_0 + arg1[t1]_0 + arg2[t2]_0 + x - n * ln(2)_1 + arg1[t1]_1 + arg2[t2]_1 + xl

where:

n is an integer, 16384 >= n >= -16495
ln(2)_0 is the first 93 bits of ln(2), and |ln(2)_0-ln(2)-ln(2)_1| < 2^-205
t1 is an integer, 89 >= t1 >= -89
t2 is an integer, 65 >= t2 >= -65
|arg1[t1]-t1/256.0| < 2^-53
|arg2[t2]-t2/32768.0| < 2^-53
x + xl is whatever is left, |x + xl| < 2^-16 + 2^-53

Then e^x is approximated as

e^x = 2^n_1 ( 2^n_0 e^(arg1[t1]_0 + arg1[t1]_1) e^(arg2[t2]_0 + arg2[t2]_1) + 2^n_0 e^(arg1[t1]_0 + arg1[t1]_1) e^(arg2[t2]_0 + arg2[t2]_1) * p (x + xl + n * ln(2)_1))

where:

p(x) is a polynomial approximating e(x)-1
e^(arg1[t1]_0 + arg1[t1]_1) is obtained from a table
e^(arg2[t2]_0 + arg2[t2]_1) likewise
n_1 + n_0 = n, so that |n_0| < -FLT128_MIN_EXP-1

In most cases n_1 == 0, meaning there is a multiplication that can be omitted.

Musl approach

Also based on the above (since commit e16f7b3c0, 2018). Claims to achieve similar (slightly better) improvements to single precision.

The implementation of quadruple precision comes from OpenBSD - implementation depends on choice of format for long double via a #ifdef, but seems to be missing a proper f128 implementation...).

FreeBSD approach