embench/embench-iot

In nbody some toolchains hoisting bodies_energy out of main loop

MarkHillHuawei opened this issue · 5 comments

This issue is related to "Compiler optimization deletes the bodies_energy function call in the nbody benchmark."

#16

Issue 16 was resolved and prevented the compiler eliminating the bodies_energy function from the main workload loop. However the fix does not prevent toolchains hoisting the function out of the loop and executing it only once instead of 100 times. When the compiler makes this optimisation the workload collapses to 100 floating point adds and results in a score 1000* that of an m4.

An example toolchain is armclang 6.14.1 (example options: --target=arm-arm-none-eabi -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -ffp-mode=fast -Os)

A possible fix is it create a pointer to the bodies_energy function, initialise it in initialise_benchmark() and call via it in the main loop.

@MarkHillHuawei Thanks for this. We are revising the list of benchmarks for Embench 2.0. Options are to fix nbody as you suggest, or to drop it altogether. Do you think this is a good benchmark to keep (when fixed) in the suite?

I think my preference would be to drop this one, I'm sure we can find fp benchmarks that feel a bit more relevant. More generally I think it would be better to separate out the floating point benchmarks from the integer ones and perhaps report 2 to 3 scores, an int only, combined and may be fp only. This is because many embedded cores have no FPU so performance is strongly influenced by the quality of the software FP library. This is a useful thing to be able to benchmark but better done separately to an assessment of integer pipeline performance.

Is there any reason that there is no function to update the xyz coordinates? As of now, after the first call to offset_momentum, the velocities don't change for any of the particles in any direction, meaning a compiler could hoist this out of the loop as well. Having a more complete version of the N-Body problem could be a solution to not having the compiler optimize out the inner main workload loop.

Instead of having the energy loop run 100 times, it may be more beneficial to instead have more bodies. This increases the opportunity for vectorization.

Slightly unrelated: There is a fill variable defined in the body struct, but its value doesn't seem to be used.