Platform Calibration (for SimGrid)

This project gathers all efforts regarding platform calibration for HPC experiments with SimGrid’s SMPI.

Table of content

Network calibration
- Explanations
- Complete example
Dgemm calibration

Network calibration

The calibrate binary is an MPI application that introduces a bunch of MPI experiments. The goal is to profile the amount of time taken to execute some MPI operations in a real environment. The timings for such operations are obtained using the most precise clock available in the local node where the binary is executed. Measurements, along with their corresponding parameters such as message size, are dumped to multiple CSV files, one per operation. In a second stage, after the experiment has finished, an R script takes as input those measurements and a break points textual file informing the message sizes where the MPI protocol changes. As output, the R script generates the values that need to be included to the corresponding platform file, in order to obtain a faithful simulation.

Explanations

The Profiled Operations

Blocking Receive (R): time spent inside the MPI_Recv function. During the benchmark initialization, through a ping-pong operation, the program measures how much time the sender must sleep between two benchmark procedures. Such sleep time is to be sure the message that has been sent has arrived in the receiver side. After that, for a number of replications, the program measures how long it took to execute the blocking receive (assuming the message has already available). To guarantee all replications are finished correctly, at the end, a message is sent to the receiver side.
Asynchronous Send (AS): time spent inside the MPI_Isend asynchronous operation. This measurement captures the time spent in the buffering data and sending it to the network card. For a number of replications, the program measures how long it took to execute this operation when changing the message size.
Ping-Pong (PP): time spent to execute a sequence of a MPI_Send followed by a MPI_Recv, representing together a ping pong operation. The time spent in each of these two operations (send and receive) is registered separately. The time spent for the whole synchronous ping-pong can be obtained by summing these two durations.
Other operations (OT): The network-unrelated functions MPI_Test, MPI_Iprobe and MPI_Wtime are also profiled during calibration since their CPU timings are neces

Network Modeling

The R and AS measurements (see above) represents the software overhead of the receive and send operations. Since the R is measured when the message is already available to be actually received, it is assumed that its duration consists exclusively in the message handling. The same is verified for the asynchronous send. Since the operation returns to the caller as soon as everything has been setup for the actual message transmission, we consider that it contains only the software overhead related to sending a message. Therefore, these two operations account for the software overhead, unrelated to network modeling itself. The network latency and bandwidth are calculated using the results of the PP measurements. Since the PP operation uses MPI_Send, we can check coherence of its behavior against the AS measurements (MPI_Isend).

The Message Sizes and Variability Check

Addressing variability requires that multiple runs for each operation. The number of replications can be set through the parameter --nb_runs. Message sizes are taken in the order established in an input file whose name is passed as parameter, using the --sizeFile parameter. The zoo_sizes file that is already in this repository provides a comprehensive set of sizes, with a good repartition for all magnitudes. Regardless of the message sizes available in the input file, we can filter them by passing --min_size followed by the minimum size and --max-size for the maximum. Since sizes can go up to 1GB in the furnished size file, the analyst can create a cut to reduce experimental time.

CLI Help and Parameter Description

Call the calibrate binary passing the --help option. You’ll get something like this:

$ ./calibrate --help
Usage: calibrate [OPTION...]
Runs MPI point to point benchmarks to compute values used for SMPI calibration

  -d, --dir_name=dir_name    Name/path of the directory to save files into. No
                             trailing slashes.
  -f, --filename=FILENAME    XML filename
  -m, --min_size=MIN SIZE    Minimum size to process
  -M, --max_size=MAX SIZE    Maximum size to process
  -n, --nb_runs=NB RUNS      number of times you want to execute the run
  -p, --prefix=PREFIX        prefix of the csv files
  -s, --sizeFile=SIZEFILE    filename of the size list
  -?, --help                 Give this help list
      --usage                Give a short usage message

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Complete example

This example relies on Python (version >= 3.6), but could be easily adapter to R. If the experiment is done on Grid’5000, we encourage the use of the peanut experiment engine.

Finding the semantic breakpoints

Quoting a thesis:

SMPI is in charge of modeling the MPI behavior, both in terms of semantic and performance. The complex network optimizations done in real MPI implementations need to be considered when predicting the performance of MPI point-to-point communications. For instance, the behavior of MPI_Send depends on the message size. The smallest messages are sent asynchronously, meaning that MPI_Send returns immediately and the data is transferred without waiting. The largest messages are sent synchronously, i.e. the call to MPI_Send will be blocking, waiting for the corresponding MPI_Recv call to be made. The medium messages are sent in a detached way, the call to MPI_Send returns immediately but the data is only transferred when the corresponding MPI_Recv call is made. These different protocols strongly impact the communication performance, it is therefore important to model them accurately.

In SMPI, this is controlled by these two variables (that should be defined in the XML platform file):

<prop id="smpi/async-small-thresh" value="64000"/>
<prop id="smpi/send-is-detached-thresh" value="64000"/>

To find these breakpoints, one can use the find_breakpoints.py script like this:

make
python find_breakpoints.py -np 2

It calls mpirun multiple times, doing a binary search for finding the two breakpoints. Any argument given after find_breakpoints.py will passed to mpirun as is (in this example, -np 2).

Note: it is possible for the two semantic breakpoints to be equal, or for one of them to be missing, this simply means that there are only two different semantic modes instead of three.

Generating the experiment file

The following code generates a file, /tmp/exp_mpi.csv, that contains the list of MPI calls to perform.

import itertools
import random

random.seed(42)
filename = '/tmp/exp_mpi.csv'

op = ['Recv', 'Isend', 'PingPong', 'Wtime', 'Wtime', 'Iprobe', 'Test']
sizes = [int(10**random.uniform(0, 9)) for _ in range(100)]
exp = list(itertools.product(op, sizes))
exp *= 20
random.shuffle(exp)

with open('/tmp/exp_mpi.csv', 'w') as f:
    for op, size in exp:
        f.write(f'{op},{size}\n')

Running the experiment (classical)

First, clone this repository and compile the program on both nodes:

git clone https://framagit.org/simgrid/platform-calibration.git
cd platform-calibration/src/calibration
make calibrate

Then, run the program (you might need to add other arguments to mpirun, like a hostfile and maybe a rankfile):

mkdir /tmp/exp
mpirun -np 2 ./calibrate -d /tmp/exp -m 0 -M 1000000000 -p exp -s /tmp/exp_mpi.csv

This will create several result files in the directory /tmp/exp.

Running the experiment (with peanut)

An alternative way to run the experiment on Grid’5000 is to use peanut. First, connect to one of Grid’5000 frontends and install peanut:

pip3 install --user git@github.com:Ezibenroc/peanut.git

Then, in addition of the file /tmp/exp_mpi.csv (see above), create the following file, named /tmp/install_mpi.yaml:

monitoring: 0                  # positive number N to enable monitoring with a period N
background_stress: False       # True to add calls to dgemm in the background
openmpi: distribution_package  # distribution_package for a 'apt install openmpi', a version string (like "4.1.0") for an installation from source (warning: experimental)

Then, run the following command. It submits a job that will perform the experiment. Change the cluster name as needed.

peanut MPICalibration run $(whoami) --deploy debian10-x64-base --cluster dahu --nbnodes 2 --walltime 01:00:00 --expfile /tmp/exp_mpi.csv --installfile /tmp/install_mpi.yaml --batch

You can query the scheduler to see if/when your job is scheduled: oarstat -u -f

Once the job is running, the file OAR.MPICalibration.*.stderr will contain the log output of the job, you can inspect this file to see what happens.

In the end, a zip archive will be created in your home directory, containing the experiment data as well as metadata.

Analyzing the experiment

The analysis of the experiment data and the generation of the platform file is demonstrated in notebook mpi_calibration.ipynb.

File pyproject.toml list the high-level dependencies of this notebook while file poetry.lock is a lock-file of the exact versions used for all the dependencies (based on Poetry).

A simple way to install all the dependencies and run the notebook using poetry is to simply run the following command in the notebooks directory:

poetry install
poetry run jupyter lab

Dgemm calibration

Forthcoming.

Ezibenroc/platform-calibration