eth-cscs/COSMA

timings in comsa_miniapp

airmler opened this issue · 2 comments

I am running the cosma miniapp on a 72 core xeon machine with the following parameters
$parallel_cosma -m 8688 -n 8688 -k 8688 -r 3
The last line of the stdout reads:
COSMA TIMES [ms] = 458 460 771

I am curious about the large spread between fastest and slowest multiplication. The fast number would mean 40 GFLOPS/core/s which is a good number for this machine. The slowest number would imply only 23 GFLOPS/core/s.

Am I right that there is a 300 ms overhead finding the optimal "parallelization strategy"? Which of both numbers would be fair to compare with other libraries like ScaLapack and others?

I am aware that this is a very extreme example. But a spread of 10-20% between fastest and slowest number is very typical.

Am I right that there is a 300 ms overhead finding the optimal "parallelization strategy"?

No, the overhead is very likely due to library initializations during the first run in the miniapp.
Multithreaded MKL is usually the the library that introduces more overhead as it has to initialize the OpenMP environment and allocate some memory during the first library calls. MPI on certain systems introduces as well some overhead during the first communications.

Thanks for fast clarification.
I conclude that the correct approach is to neglect the slowest number.