Potential slow use of `std::pow` in libm

Question

Potential slow use of `std::pow` in libm

aornugent opened this issue 2 years ago · 1 comments

As part of investigating #346 - @devmitch demonstrated how to profile an R session using AMDuProf on machines running AMD CPUs.

Notably, 40% of the run_plant_benchmarks takes place in the pow operator of libm.

Summaries

Modules	CPU_TIME(s)
libm.so.6	12.13
plant.so	7.88
libc.so.6	2.11
libR.so	1.96
libstdc++.so.6.0.30	0.14
[kernel.kallsyms]_text	0.03
libgcc_s.so.1	0.00
rlang.so	0.00

Calls

Functions	Modules	CPU_TIME(s)
__ieee754_pow_fma	libm.so.6	10.25
compute_competition(double) const	plant.so	1.35
__pow	libm.so.6	1.19
tk::spline::operator()(double) const	plant.so	0.67
"_M_invoke(std::_Any_data const&, double&&)"	plant.so	0.67
__memcmp_avx2_movbe	libc.so.6	0.60
"plant::K93_Strategy::compute_competition(double, double) const"	plant.so	0.51
malloc	libc.so.6	0.46
plant::util::is_finite(double)	plant.so	0.36
... truncated

A little bit of digging into stuff that I don't totally understand suggests that there's an edge case where pow becomes a very expensive operations for certain exponents:
http://entropymine.com/imageworsener/slowpow/

This appears to be required for high precision usecases:
https://stackoverflow.com/questions/9272155/replacing-extrordinarily-slow-pow-function
https://stackoverflow.com/questions/14687665/very-slow-stdpow-for-bases-very-close-to-1

Something that takes 40% of the runtime is a tempting optimisation target. It's heartening to see that the plant routines, including spline driven operations (e.g. light competition), are so fast.

Answer 1 · 2022-11-22T04:55:09.000Z

Good find @aornugent @devmitch

I'm not surprised by this for two reasons

The most expensive part of the model is running compute_competiton, which eventually comes back to calling this function on individuals

https://github.com/traitecoevo/plant/blob/572d2a6783558405dd162e9ab6602a97aa86c54e/src/ff16_strategy.cpp#L401

which itself calls two functions with calls to pow.

There's other calls on power functions at different points

However, the results above suggest compute competition is 10% of cost. It's unclear whether that's the total of all calls happening in the compute competition stack or not.

It may be that there's a lot of potential speed gain to be had by economising on number of calls to pow.

for example, are there instances where we can use a more efficient call, like x*X instead of pow(x, 2)