Potential slow use of `std::pow` in libm
aornugent opened this issue · 1 comments
As part of investigating #346 - @devmitch demonstrated how to profile an R session using AMDuProf on machines running AMD CPUs.
Notably, 40% of the run_plant_benchmarks
takes place in the pow
operator of libm.
Summaries
Modules | CPU_TIME(s) |
---|---|
libm.so.6 | 12.13 |
plant.so | 7.88 |
libc.so.6 | 2.11 |
libR.so | 1.96 |
libstdc++.so.6.0.30 | 0.14 |
[kernel.kallsyms]_text | 0.03 |
libgcc_s.so.1 | 0.00 |
rlang.so | 0.00 |
Calls
Functions | Modules | CPU_TIME(s) |
---|---|---|
__ieee754_pow_fma | libm.so.6 | 10.25 |
compute_competition(double) const | plant.so | 1.35 |
__pow | libm.so.6 | 1.19 |
tk::spline::operator()(double) const | plant.so | 0.67 |
"_M_invoke(std::_Any_data const&, double&&)" | plant.so | 0.67 |
__memcmp_avx2_movbe | libc.so.6 | 0.60 |
"plant::K93_Strategy::compute_competition(double, double) const" | plant.so | 0.51 |
malloc | libc.so.6 | 0.46 |
plant::util::is_finite(double) | plant.so | 0.36 |
... truncated |
A little bit of digging into stuff that I don't totally understand suggests that there's an edge case where pow
becomes a very expensive operations for certain exponents:
http://entropymine.com/imageworsener/slowpow/
This appears to be required for high precision usecases:
https://stackoverflow.com/questions/9272155/replacing-extrordinarily-slow-pow-function
https://stackoverflow.com/questions/14687665/very-slow-stdpow-for-bases-very-close-to-1
Something that takes 40% of the runtime is a tempting optimisation target. It's heartening to see that the plant routines, including spline driven operations (e.g. light competition), are so fast.
Good find @aornugent @devmitch
I'm not surprised by this for two reasons
- The most expensive part of the model is running
compute_competiton
, which eventually comes back to calling this function on individuals
https://github.com/traitecoevo/plant/blob/572d2a6783558405dd162e9ab6602a97aa86c54e/src/ff16_strategy.cpp#L401
which itself calls two functions with calls to pow
.
- There's other calls on power functions at different points
However, the results above suggest compute competition is 10% of cost. It's unclear whether that's the total of all calls happening in the compute competition stack or not.
It may be that there's a lot of potential speed gain to be had by economising on number of calls to pow.
for example, are there instances where we can use a more efficient call, like x*X instead of pow(x, 2)