A collection of functions for fast number crunching using Fortran.
In order to get the maximum performance of this library, compile with "-O3 -march=native" (or equivalent).
Available functions
function
name(s)
shapes
types
sum
fsumfsum_kahan(1)
1d
real32real64
dot
fprodfprod_kahan(2)
1d
real32real64
cos
fcos
elemental
real32real64
sin
fsin
elemental
real32real64
tan
ftan
elemental
real32real64
tanh
ftanh
elemental
real32real64
acos
facos
elemental
real32real64
atan
fatan
elemental
real32real64
erf
ferf
elemental
real32real64
log
flog_p3flog_p5
elemental
real64
rsqrt(3)
frsqrt
elemental
real32real64
(1) fast (and precise) sum for 1D arrays - possibility of including a mask.
fsum: fastest method and at worst, same or 1 order of magnitud more precise than the intrinsic sum. It groups chunks of values in a temporal working batch which is summed up once at the end.
fsum_kahan: Highest precision. It has a precission close to a quadratic sum (for real32 summing with real64, and fo real64 summing with real128). It also uses the chunks principle with an elemental kahan operator applied on top.
(2) fast (and precise) dot product for 1D arrays - possibility of including a 3rd weighting array.
fprod: fastest method and at worst, 1 order of magnitud more precise than the intrinsic dot_product. runtime can vary between 3X and 8X the intrinsic. It groups chunks of products in a temporal working batch which is summed up once at the end (based on fsum).
fprod_kahan: Same idea as fsum_kahan but on top of chunked products.
To generate the API documentation for fast_math using
ford run the following
command:
ford ford.yml
TODO
Contribution guidelines
Polish autodoc
Elapsed time examples and precision
Warning: The following values are just references as to see how different can they be between different compilers. Actual speed-ups(downs) should be measured under the true use conditions to account for (lack-off) inlinement, etc etc. Results obtained using a Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz 2.89 GHz.
(Click to unfold) WSL2 gfortran 13.2 > fpm test --flag "-cpp -O3 -march=native -flto"
sum r32
[ns/eval]
Speed-Up
relative error
intrinsic
1.0300
1.00
3.1511E-06
kahan
0.1200
8.58
9.5367E-08
chunk
0.0900
11.44
1.0824E-07
sum r64
[ns/eval]
Speed-Up
relative error
intrinsic
1.2100
1.00
5.6974E-15
kahan
0.4300
2.81
1.3278E-16
chunk
0.1100
11.00
2.3359E-16
sum r32 mask
[ns/eval]
Speed-Up
relative error
intrinsic
1.2300
1.00
1.6180E-06
kahan
4.3400
0.28
8.3327E-08
chunk
0.3800
3.24
8.8394E-08
sum r64 mask
[ns/eval]
Speed-Up
relative error
intrinsic
3.8600
1.00
2.9463E-15
kahan
4.1950
0.92
6.8723E-17
chunk
0.4200
9.19
1.1879E-16
dot r32
[ns/eval]
Speed-Up
relative error
intrinsic
1.2100
1.00
3.2994E-06
kahan
0.2300
5.26
9.8348E-08
chunk
0.1200
10.08
1.1307E-07
dot r64
[ns/eval]
Speed-Up
relative error
intrinsic
1.1900
1.00
5.9648E-15
kahan
0.4400
2.70
1.2812E-16
chunk
0.0900
13.22
2.2760E-16
trigo
[ns/eval]
Speed-Up
relative error
fsin r32
0.5280
7.54
3.0972E-07
fsin r64
0.9320
8.76
3.9779E-16
facos r32
0.3080
20.87
2.9135E-05
facos r64
0.5960
15.90
2.1557E-14
hyperb
[ns/eval]
Speed-Up
relative error
ftanh r32
1.7280
10.07
7.4200E-08
ftanh r64
1.9360
9.32
1.3282E-09
ferf r32
0.4760
31.71
9.6432E-08
ferf r64
0.7640
18.42
9.6298E-08
rsqrt
[ns/eval]
Speed-Up
relative error
frsqrt r32
0.3480
1.06
9.4399E-04
frsqrt r64
0.6320
2.23
8.6268E-04
(Click to unfold) WSL2 nvfortran 23.9 > fpm test --flag "-Mpreprocess -fast -Minline"
sum r32
[ns/eval]
Speed-Up
relative error
intrinsic
0.1000
1.00
1.1295E-07
kahan
1.2500
0.08
9.8169E-08
chunk
0.0700
1.43
7.0930E-08
sum r64
[ns/eval]
Speed-Up
relative error
intrinsic
0.1400
1.00
3.8969E-16
kahan
1.6300
0.09
1.2623E-16
chunk
0.2500
0.56
1.8996E-16
sum r32 mask
[ns/eval]
Speed-Up
relative error
intrinsic
0.1700
1.00
2.0742E-07
kahan
5.5650
0.03
8.1956E-08
chunk
0.2550
0.67
5.8889E-08
sum r64 mask
[ns/eval]
Speed-Up
relative error
intrinsic
0.3600
1.00
3.8136E-16
kahan
5.7750
0.06
6.2839E-17
chunk
0.4400
0.82
8.5598E-17
dot r32
[ns/eval]
Speed-Up
relative error
intrinsic
0.1400
1.00
1.1426E-07
kahan
1.9700
0.07
9.7811E-08
chunk
0.1700
0.82
7.1764E-08
dot r64
[ns/eval]
Speed-Up
relative error
intrinsic
0.2400
1.00
3.9246E-16
kahan
1.8700
0.13
1.3178E-16
chunk
0.4100
0.59
1.9129E-16
trigo
[ns/eval]
Speed-Up
relative error
fsin r32
0.0160
726.00
1.0325E-07
fsin r64
0.0280
388.86
5.0118E-17
facos r32
0.0120
466.67
1.0563E-06
facos r64
0.0200
390.60
3.7996E-15
hyperb
[ns/eval]
Speed-Up
relative error
ftanh r32
0.0240
676.67
5.3264E-08
ftanh r64
0.0080
1595.00
1.3282E-09
ferf r32
0.0040
4851.00
9.1205E-08
ferf r64
0.0320
549.62
9.6298E-08
rsqrt
[ns/eval]
Speed-Up
relative error
frsqrt r32
16.5480
0.02
9.4387E-04
frsqrt r64
15.8280
0.09
8.6745E-04
(Click to unfold) WSL2 ifort 2021.10.0 > fpm test --flag "-fpp -O3 -xHost -ipo"
sum r32
[ns/eval]
Speed-Up
relative error
intrinsic
0.0700
1.00
6.2262E-08
kahan
0.2400
0.29
9.4564E-08
chunk
0.1000
0.70
7.0930E-08
sum r64
[ns/eval]
Speed-Up
relative error
intrinsic
0.0800
1.00
1.9862E-16
kahan
0.5200
0.15
1.2867E-16
chunk
0.1400
0.57
2.0384E-16
sum r32 mask
[ns/eval]
Speed-Up
relative error
intrinsic
0.2000
1.00
2.0568E-07
kahan
0.2150
0.93
7.7122E-08
chunk
0.1450
1.38
6.7770E-08
sum r64 mask
[ns/eval]
Speed-Up
relative error
intrinsic
0.2150
1.00
1.9040E-16
kahan
0.4400
0.49
7.0610E-17
chunk
0.3700
0.58
8.5154E-17
dot r32
[ns/eval]
Speed-Up
relative error
intrinsic
0.0700
1.00
6.2031E-08
kahan
0.2100
0.33
1.0544E-07
chunk
0.0500
1.40
7.1526E-08
dot r64
[ns/eval]
Speed-Up
relative error
intrinsic
0.2200
1.00
6.3782E-16
kahan
0.4600
0.48
2.4047E-16
chunk
0.1200
1.83
1.8829E-16
trigo
[ns/eval]
Speed-Up
relative error
fsin r32
0.3560
1.26
1.9746E-07
fsin r64
0.9280
1.38
7.5661E-17
facos r32
0.3200
2.01
3.0743E-06
facos r64
0.6520
3.36
6.3642E-15
hyperb
[ns/eval]
Speed-Up
relative error
ftanh r32
0.3960
3.70
1.1537E-08
ftanh r64
0.6760
5.17
1.3282E-09
ferf r32
0.3360
2.50
1.0924E-07
ferf r64
0.8440
2.18
9.6298E-08
rsqrt
[ns/eval]
Speed-Up
relative error
frsqrt r32
0.2600
1.31
9.4032E-04
frsqrt r64
0.6360
2.27
8.7360E-04
(Click to unfold) Windows ifx 2023.2.0 > fpm test --flag "-fpp -O3 -xHost -ipo"
sum r32
[ns/eval]
Speed-Up
relative error
intrinsic
0.3200
1.00
8.4376E-07
kahan
1.0300
0.31
8.7321E-08
chunk
0.4800
0.67
8.7082E-08
sum r64
[ns/eval]
Speed-Up
relative error
intrinsic
1.1200
1.00
5.7371E-15
kahan
0.9400
1.19
1.9507E-16
chunk
0.5600
2.00
1.9418E-16
sum r32 mask
[ns/eval]
Speed-Up
relative error
intrinsic
2.2700
1.00
1.5584E-06
kahan
4.4750
0.51
9.1434E-08
chunk
4.7200
0.48
8.7559E-08
sum r64 mask
[ns/eval]
Speed-Up
relative error
intrinsic
2.1750
1.00
2.9075E-15
kahan
4.7550
0.46
1.0636E-16
chunk
4.0250
0.54
1.0525E-16
dot r32
[ns/eval]
Speed-Up
relative error
intrinsic
0.2600
1.00
7.9530E-07
kahan
1.3800
0.19
6.8307E-08
chunk
0.4900
0.53
6.9737E-08
dot r64
[ns/eval]
Speed-Up
relative error
intrinsic
0.6200
1.00
2.9848E-15
kahan
1.4200
0.44
1.8197E-16
chunk
0.5800
1.07
1.8330E-16
trigo
[ns/eval]
Speed-Up
relative error
fsin r32
3.4640
0.47
1.3924E-07
fsin r64
3.2320
1.31
1.0296E-15
facos r32
1.3960
5.22
3.1710E-05
facos r64
1.4080
6.28
5.2928E-13
hyperb
[ns/eval]
Speed-Up
relative error
ftanh r32
2.8280
1.22
2.3012E-08
ftanh r64
2.6280
2.97
1.3282E-09
ferf r32
3.8600
1.57
3.0995E-07
ferf r64
3.9600
5.67
9.6298E-08
rsqrt
[ns/eval]
Speed-Up
relative error
frsqrt r32
1.6640
0.19
9.4038E-04
frsqrt r64
1.4320
0.96
8.7360E-04
Acknowledgement
Compilation of this library was possible thanks to Transvalor S.A. research activities.
Part of this library is based on the work of Perini and Reitz, that was funded through the Sandia National Laboratories by the U.S. Department of Energy, Office of Vehicle Technologies, program managers Leo Breton, Gupreet Singh.