traitecoevo/plant

Potential slow use of `std::pow` in libm

aornugent opened this issue · 1 comments

As part of investigating #346 - @devmitch demonstrated how to profile an R session using AMDuProf on machines running AMD CPUs.

Notably, 40% of the run_plant_benchmarks takes place in the pow operator of libm.

Summaries

Modules CPU_TIME(s)
libm.so.6 12.13
plant.so 7.88
libc.so.6 2.11
libR.so 1.96
libstdc++.so.6.0.30 0.14
[kernel.kallsyms]_text 0.03
libgcc_s.so.1 0.00
rlang.so 0.00

Calls

Functions Modules CPU_TIME(s)
__ieee754_pow_fma libm.so.6 10.25
compute_competition(double) const plant.so 1.35
__pow libm.so.6 1.19
tk::spline::operator()(double) const plant.so 0.67
"_M_invoke(std::_Any_data const&, double&&)" plant.so 0.67
__memcmp_avx2_movbe libc.so.6 0.60
"plant::K93_Strategy::compute_competition(double, double) const" plant.so 0.51
malloc libc.so.6 0.46
plant::util::is_finite(double) plant.so 0.36
... truncated

A little bit of digging into stuff that I don't totally understand suggests that there's an edge case where pow becomes a very expensive operations for certain exponents:
http://entropymine.com/imageworsener/slowpow/

This appears to be required for high precision usecases:
https://stackoverflow.com/questions/9272155/replacing-extrordinarily-slow-pow-function
https://stackoverflow.com/questions/14687665/very-slow-stdpow-for-bases-very-close-to-1

Something that takes 40% of the runtime is a tempting optimisation target. It's heartening to see that the plant routines, including spline driven operations (e.g. light competition), are so fast.

Good find @aornugent @devmitch

I'm not surprised by this for two reasons

  1. The most expensive part of the model is running compute_competiton, which eventually comes back to calling this function on individuals
https://github.com/traitecoevo/plant/blob/572d2a6783558405dd162e9ab6602a97aa86c54e/src/ff16_strategy.cpp#L401

which itself calls two functions with calls to pow.

  1. There's other calls on power functions at different points

However, the results above suggest compute competition is 10% of cost. It's unclear whether that's the total of all calls happening in the compute competition stack or not.

It may be that there's a lot of potential speed gain to be had by economising on number of calls to pow.

for example, are there instances where we can use a more efficient call, like x*X instead of pow(x, 2)