carpentries-incubator/hpc-intro

Measuring parallel execution time

Opened this issue · 5 comments

MPI_WTIME has a very accurate time counter and may be better to use than the language intrinsics

Do we need that accuracy to get our point across though? It doesn't come without baggage and I think rule 6 from The Rules applies.

Most MPI codes will use MPI_WTIME - it is not essential, but for moderate problem sizes, regular timing functions will give 0 and MPI_WTIME will return a measurement. If people will do scaling tests, this can be helpful in getting a wider range. Most people have been able to use it without needing extra explanation. If you assume about 10^8 calculations per second on a single core, 1000 samples will take 10^-5 seconds to compute.

The HPC intro is not about doing scaling tests though, and I imagine that many (even most) of the participants in a lesson like this are not familiar with MPI. In my opinion, if you were going to use it you would have to explain why...and at that point you have strayed beyond the lesson objectives. Introducing MPI is not for the 1000 sample case (why would you need MPI for this?), it's for the 100,000,000 case where the problem either doesn't fit or takes too long.

The solution to this problem using this algorithm should not have any memory issues when correctly written. Many people prematurely use an HPC resource when efficient implementation may be a better solution. Teaching with inappropriately written code reinforces this tendency.

The lesson discusses Amdahl's law on upto 16 and 100 cores. See the graphs at:
https://carpentries-incubator.github.io/hpc-intro/16-parallel/index.html
As the lesson is hands on, attendees are also expected to do measurements similar to those.
Some measurements, without a graph, similar to what attendees would produce can be found at:
https://github.com/bkmgit/hpc-intro/blob/bkmgit-parallel-fortran/_episodes/16-parallel-fortran.md

This is a good example of tension between pedagogical goals and technical goals. @bkmgit is right that there are better technical solutions to this problem, but our goal here is to convey the mechanics of parallel programming, to establish the connection between how parallel resources are allocated by the user using the resource-manager commands, and how they appear to the application program, which then uses them to gain a performance boost.

Hopefully using a "calculating pi" example makes this clear, since after all our learners are likely aware that the value of pi is already known to many digits. This helps reinforce that the result of computing pi is not interesting, it's the process that is valuable.

There's lots of room for discussion of Amdah'ls law and algorithm selection in a follow-on lesson.