This project gathers all efforts regarding platform calibration for HPC experiments with SimGrid’s SMPI.
The calibrate
binary is an MPI application that introduces a bunch of
MPI experiments. The goal is to profile the amount of time taken to
execute some MPI operations in a real environment. The timings for such
operations are obtained using the most precise clock available in the
local node where the binary is executed. Measurements, along with their
corresponding parameters such as message size, are dumped to multiple
CSV files, one per operation. In a second stage, after the experiment
has finished, an R script takes as input those measurements and a break
points textual file informing the message sizes where the MPI protocol
changes. As output, the R script generates the values that need to be
included to the corresponding platform file, in order to obtain a
faithful simulation.
- Blocking Receive (R): time spent inside the MPI_Recv function. During the benchmark initialization, through a ping-pong operation, the program measures how much time the sender must sleep between two benchmark procedures. Such sleep time is to be sure the message that has been sent has arrived in the receiver side. After that, for a number of replications, the program measures how long it took to execute the blocking receive (assuming the message has already available). To guarantee all replications are finished correctly, at the end, a message is sent to the receiver side.
- Asynchronous Send (AS): time spent inside the MPI_Isend asynchronous operation. This measurement captures the time spent in the buffering data and sending it to the network card. For a number of replications, the program measures how long it took to execute this operation when changing the message size.
- Ping-Pong (PP): time spent to execute a sequence of a MPI_Send followed by a MPI_Recv, representing together a ping pong operation. The time spent in each of these two operations (send and receive) is registered separately. The time spent for the whole synchronous ping-pong can be obtained by summing these two durations.
- Other operations (OT): The network-unrelated functions MPI_Test, MPI_Iprobe and MPI_Wtime are also profiled during calibration since their CPU timings are neces
The R and AS measurements (see above) represents the software overhead of the receive and send operations. Since the R is measured when the message is already available to be actually received, it is assumed that its duration consists exclusively in the message handling. The same is verified for the asynchronous send. Since the operation returns to the caller as soon as everything has been setup for the actual message transmission, we consider that it contains only the software overhead related to sending a message. Therefore, these two operations account for the software overhead, unrelated to network modeling itself. The network latency and bandwidth are calculated using the results of the PP measurements. Since the PP operation uses MPI_Send, we can check coherence of its behavior against the AS measurements (MPI_Isend).
Addressing variability requires that multiple runs for each operation. The
number of replications can be set through the parameter --nb_runs
. Message
sizes are taken in the order established in an input file whose name is passed
as parameter, using the --sizeFile
parameter. The zoo_sizes
file that is already
in this repository provides a comprehensive set of sizes, with a good
repartition for all magnitudes. Regardless of the message sizes available in
the input file, we can filter them by passing --min_size
followed by the minimum
size and --max-size
for the maximum. Since sizes can go up to 1GB in the
furnished size file, the analyst can create a cut to reduce experimental time.
Call the calibrate
binary passing the --help
option. You’ll get something like
this:
$ ./calibrate --help Usage: calibrate [OPTION...] Runs MPI point to point benchmarks to compute values used for SMPI calibration -d, --dir_name=dir_name Name/path of the directory to save files into. No trailing slashes. -f, --filename=FILENAME XML filename -m, --min_size=MIN SIZE Minimum size to process -M, --max_size=MAX SIZE Maximum size to process -n, --nb_runs=NB RUNS number of times you want to execute the run -p, --prefix=PREFIX prefix of the csv files -s, --sizeFile=SIZEFILE filename of the size list -?, --help Give this help list --usage Give a short usage message Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.
This example relies on Python (version >= 3.6), but could be easily adapter to R. If the experiment is done on Grid’5000, we encourage the use of the peanut experiment engine.
Quoting a thesis:
SMPI is in charge of modeling the MPI behavior, both in terms of semantic and performance. The complex network optimizations done in real MPI implementations need to be considered when predicting the performance of MPI point-to-point communications. For instance, the behavior of
MPI_Send
depends on the message size. The smallest messages are sent asynchronously, meaning thatMPI_Send
returns immediately and the data is transferred without waiting. The largest messages are sent synchronously, i.e. the call toMPI_Send
will be blocking, waiting for the correspondingMPI_Recv
call to be made. The medium messages are sent in a detached way, the call toMPI_Send
returns immediately but the data is only transferred when the correspondingMPI_Recv
call is made. These different protocols strongly impact the communication performance, it is therefore important to model them accurately.
In SMPI, this is controlled by these two variables (that should be defined in the XML platform file):
<prop id="smpi/async-small-thresh" value="64000"/>
<prop id="smpi/send-is-detached-thresh" value="64000"/>
To find these breakpoints, one can use the find_breakpoints.py script like this:
make
python find_breakpoints.py -np 2
It calls mpirun
multiple times, doing a binary search for finding the two
breakpoints. Any argument given after find_breakpoints.py
will passed to mpirun
as is (in this example, -np 2
).
Note: it is possible for the two semantic breakpoints to be equal, or for one of them to be missing, this simply means that there are only two different semantic modes instead of three.
The following code generates a file, /tmp/exp_mpi.csv
, that contains the list of
MPI calls to perform.
import itertools
import random
random.seed(42)
filename = '/tmp/exp_mpi.csv'
op = ['Recv', 'Isend', 'PingPong', 'Wtime', 'Wtime', 'Iprobe', 'Test']
sizes = [int(10**random.uniform(0, 9)) for _ in range(100)]
exp = list(itertools.product(op, sizes))
exp *= 20
random.shuffle(exp)
with open('/tmp/exp_mpi.csv', 'w') as f:
for op, size in exp:
f.write(f'{op},{size}\n')
First, clone this repository and compile the program on both nodes:
git clone https://framagit.org/simgrid/platform-calibration.git
cd platform-calibration/src/calibration
make calibrate
Then, run the program (you might need to add other arguments to mpirun, like a hostfile and maybe a rankfile):
mkdir /tmp/exp
mpirun -np 2 ./calibrate -d /tmp/exp -m 0 -M 1000000000 -p exp -s /tmp/exp_mpi.csv
This will create several result files in the directory /tmp/exp
.
An alternative way to run the experiment on Grid’5000 is to use peanut. First, connect to one of Grid’5000 frontends and install peanut:
pip3 install --user git@github.com:Ezibenroc/peanut.git
Then, in addition of the file /tmp/exp_mpi.csv
(see above), create the following
file, named /tmp/install_mpi.yaml
:
monitoring: 0 # positive number N to enable monitoring with a period N
background_stress: False # True to add calls to dgemm in the background
openmpi: distribution_package # distribution_package for a 'apt install openmpi', a version string (like "4.1.0") for an installation from source (warning: experimental)
Then, run the following command. It submits a job that will perform the experiment. Change the cluster name as needed.
peanut MPICalibration run $(whoami) --deploy debian10-x64-base --cluster dahu --nbnodes 2 --walltime 01:00:00 --expfile /tmp/exp_mpi.csv --installfile /tmp/install_mpi.yaml --batch
You can query the scheduler to see if/when your job is scheduled: oarstat -u -f
Once the job is running, the file OAR.MPICalibration.*.stderr
will contain the
log output of the job, you can inspect this file to see what happens.
In the end, a zip
archive will be created in your home directory, containing the
experiment data as well as metadata.
The analysis of the experiment data and the generation of the platform file is demonstrated in notebook mpi_calibration.ipynb.
File pyproject.toml list the high-level dependencies of this notebook while file poetry.lock is a lock-file of the exact versions used for all the dependencies (based on Poetry).
A simple way to install all the dependencies and run the notebook using poetry
is to simply run the following command in the notebooks
directory:
poetry install
poetry run jupyter lab
Forthcoming.