ANL-CESAR/XSBench

Result of computation is never checked -> optimising compilers skew results

Closed this issue · 2 comments

I manually added the LTO options to CFLAGS / LDFLAGS on GCC and the compiler is smart enough to throw away the computation.

The issue is that in Main.c calculate_macro_xs, the return value (in macro_xs_vector) is never checked in the main function. If I force the generation of the results through asm volatile (""::"m"(macro_xs_vector[0]),"m"(macro_xs_vector[1]), ...) the results change significantly:
Comparing lookups/s on a three core machine:

$ while true; do res1=$(./XSBench -s small | awk '/Lookups.s:/ {print $2}'); res2=$(./XSBench.force_use -s small | awk '/Lookups.s:/ {print $2}'); echo $res1 $res2; done
6,142,927 920,383
5,513,363 983,074
5,243,478 991,507

Shows a over 6x speed increase due to the optimised out computation in the existing code. The numbers on the right are very similar to what I get if I disable LTO.
Please fix the usage of the results, through either this mechanism (empty asm volatile consuming the data), or by keeping a running sum of the results etc.

Thanks for notifying us of this!

I'm able to reproduce the issue when the -flto flag was used for linker time optimization. I've added in a copy to the heap in order to ensure that the code behaves as expected when the -flto flag is used.

Note that this issue does not affect the behavior of the code under normal compilation, it only occurs with -flto.

Thanks for fixing. I do not find the solution with the memcpy particularly elegant. My proposal with the asm volatile (that can be mapped into a "consume" macro) would avoid the copy, but is also a bit of a hack. Having a very clean running sum of the results might overserialise the parallel execution.

Is there some architectural guarantee that the compiler does not trace through the memcpy and optimise that away when the other part is never used either?

In either case, this seems to work for now, so thanks for fixing.