simulator

Our work makes use of Inria's Batsim (https://batsim.readthedocs.io/) simulator. We have added a node fault model and simulated job checkpoint / restart in order to more easily explore the trade-offs between performance and reliability. This repo is associated with our IEEE HPEC 2021 submitted article entitled "Exploring the Tradeoff Between Reliability and Performance in HPC Systems."

Scripts are provided to apply patches to the original Batsim source and run experiments congruent with those presented in our article. They have been packaged here to be applied, built, and executed in a dockerized format for ease of use and replication of our experimental data presented in the HPEC article. Those that wish to learn more about the native Batsim are encouraged to visit the Batsim homepage directly.

Build Docker
Run The Docker
Test The Docker
Explanation Of total_makespan.csv
Steps To Run Simulations
- How To Edit Config File
- Example
Monte Carlo

How to build the docker

clone this repo:
git clone https://github.com/hpec-2021-ccu-lanl/simulator.git
enter directory:
cd simulator
build the docker and name the image "simulator":
docker build . -t simulator

How to run the docker

create the docker container based off the "simulator" image and name it "batsim_docker":
docker create --name batsim_docker -t simulator
start the docker container:
docker start batsim_docker
start an interactive shell:
docker exec -it batsim_docker /bin/bash

Test the batsim_docker

Before we test our docker, please bear with us on some confusing terminology:

This needs clarification. An "experiment", as far as the config file is concerned, is a json element that has an input and an output. You can make multiple experiments in one config file. Below, the "experiment" is "test" and all data for that experiment will be in the folder "test".

Each "job", as it relates to the config file, is one set of parameters used for a simulation. For example, a simulation having a cluster with 1500 nodes vs a simulation having a cluster with 1600 nodes are two different "jobs". Similarly two simulations both having 1500 nodes but differing on SMTBF are two different "jobs". The confusion here is that "jobs" in this sense are titled "experiment_#" and so are their folder that comes under the "experiment" folder.

"Runs" are simulations in the same "experiment" that have the exact same parameters and so come under the same "job" and are used for averaging purposes. To further complicate things, there are the "simulated jobs". These are part of the workloads that the simulator is running.

Ok, that is out of the way

Test the batsim_docker to see if it gives you the correct results. This will make sure the docker is running properly, but will also give you a chance to see how the process of running simulations goes. We will use a config file "test_docker.config".

{  "test":{
            "input": {
                        "node-sweep":{
                                "range":[1490]
                        },
                        "SMTBF-sweep":{
                                "compute-SMTBF-from-NMTBF":true,
                                "formula":"128736000 * (1/i)",
                                "range":[1,8]
                        },
                        "checkpoint-sweep":{
                                "range":["optimal"]
                        },
                        "performance-sweep":{
                                "range":[1.0]
                        },
                        "checkpointing-on":true,
                        "synthetic-workload":{
                                "number-of-jobs":30000,
                                "number-of-resources":"/home/sim/basefiles/workload_types/wl2.csv:0:csv",
                                "duration-time":"/home/sim/basefiles/workload_types/wl2.csv:1:h:csv",
                                "submission-time":"0:fixed",
                                "wallclock-limit":-1,
                                "dump-time":"3%",
                                "read-time":"2%"
                        }
          },
          "output": {
                        "AAE":true,
                        "avg-makespan":1
          }
   }
}

Just a real quick intro to this config file...

We are sweeping over nodes, but really there is no sweep, as we used a fixed "range" and in that list of nodes there is only one value [1490]. There are tools available to do a real sweep, but we won't get into that just yet.
We use a formula for (S)ystem (M)ean (T)ime (B)etween (F)ailure.
- 128736000 * 1/i where "i" is replaced with a "range": [1,8]
- So we will have 128736000 / 1 and 128736000 / 8
- For clarification: 128,736,000 node seconds = 24 system hrs * 3600seconds/hr * 1490 nodes/system
  - so this is a system failure rate of 1 failure every 24 hours for baseline ( "range" : [1] ) and for 8x worse ( "range" : [8] )
We let the simulator compute the optimal checkpointing for each job
We set the speed of the system to 1.0 (normal speed) where higher is slower ( a 30% faster and 30% slower system would be 0.70 and 1.30 respectively )
We give an option for all jobs, checkpointing-on.
We set up a workload
We tell it what kind of output we need
- Average Application Efficiency
- makespan with only 1 run (replace "1" with "200" for 200 runs)
  - we did not pass the seed-failures option to all jobs. This means our results will be deterministic. Therefore, we only need 1 run. Normally we want to take an average of at least 200 runs when it is not deterministic. However, we need to confirm exact numbers for our test so we will keep it deterministic by leaving the seed-failures option off.

Ok, let's run this test

you should already be in the /home/sim/basefiles directory. If not, head there. cd /home/sim/basefiles
set two variables to make things easier:
file1=./configs/test_docker.config
folder1=test_docker
Now we get our workloads made, input/output folders made for each run, underlying config files made, and then the simulations begin. python3 run_simulation.py --config $file1 --output ~/experiments/$folder1

since the docker is not a cluster and simulations run sequentially, there is a handy counter that is flushed to output before every simulation. For example:

    Experiment 1/1
    Job 1/2
    Run 1/1

    ...

    Experiment 1/1
    Job 2/2
    Run 1/1

We have results but they need to be aggregated: python3 aggregate_makespan.py -i ~/experiments/$folder1

After running this you should have a file called "total_makespan.csv" under the ~/experiments/$folder1 folder

Now to make determining whether we have correct results easier, run the following:

cat ~/experiments/$folder1/total_makespan.csv | \
awk -F, BEGIN'{printf "\n"}''(NR>1)''{printf "%f\t%s,%s\n",$5,$7,$8}'END'{printf "\n"}'

you should get


4896980.836070  "56 days, 16:16:20"            <----baseline makespan (displayed as seconds, then days,hh:mm:ss)
7861976.738434  "90 days, 23:52:56"            <----8x worse failures makespan (displayed as seconds, then days,hh:mm:ss)

If you wanted to know how much worse the makespan is for the worse failure rate, it's just a matter of taking the 8x worse makespan (use the seconds) and dividing by the baseline (use the seconds).

So that's basically it

That is all that is needed to run simulations with the docker. You basically edit the config file, set the file1/folder1, run "run_simulation.py", aggregate the results, and do something with the aggregation.

total_makespan.csv clarification

Let's make the total_makespan.csv clear. I've put definitions for each field:

The first line is the header:
,nodes,SMTBF,NMTBF,makespan_sec,avg_tat,makespan_dhms,avg_tat_dhms,AAE,checkpointed_num,percent_checkpointed,checkpointing_on_num,checkpointing_on_percent,job,exp

starting with "nodes"

nodes
- That's easy, just the amount of nodes the system had for that job
SMTBF
- System Mean Time Between Failure for that job
NMTBF
- Node Mean Time Between Failure for that job. This is easier to look at and use for grouping as it doesn't matter how many nodes the system had for that job
makespan_sec
- makespan in seconds
avg_tat
- average Turn Around Time in seconds
makespan_dhms
- makespan in days, hours:minutes:seconds format. Keep in mind that there is a comma in this representation which may or may not mess things up for your comma separated values' file parsing
avg_tat_dhms
- similar to makespan_dhms, it is the Turn Around Time in days, hours:minutes:seconds format
AAE
- The Average-Average Application Efficiency of the job
checkpointed_num
- The average number of jobs that had to be restarted(maybe multiple times) and had checkpointed before failing(so they were able to be restarted by reading the checkpoint data)
percent_checkpointed
- Same as checkpointed_num except given as a percent
checkpointing_on_num
- If a global checkpointing interval is 4 hours but the individual job in the simulation was only 1 hour then no checkpointing time takes place. Same situation can happen with "optimal" as it is dependent on dump time and not run time. In this situation we call checkpointing "off" for that individual job. This field is an average amount of jobs where the jobs were long enough to incorporate checkpointing and were so called "on".
checkpointing_on_percent
- same as checkpointing_on_num except given as a percent
number_of_jobs
- The amount of jobs in the workload for that experiment
utilization
- The average utilization of the simulated system over the simulation's makespan.
job
- This needs clarification. An "experiment", as far as the config file is concerned, is a json element that has an input and an output. You can make multiple experiments in one config file. Each "job", as it relates to the config file, is one set of parameters used for a simulation. For example, a simulation using 1500 nodes vs a simulation using 1600 nodes are two different "jobs". Similarly two simulations both having 1500 nodes but differing on SMTBF are two different "jobs". The confusion here is that "jobs" in this sense are titled "experiment_#" and so are their folder that comes under the "experiment" folder. "Runs" are simulations in the same "experiment" that have the exact same parameters and so come under the same "job" and are used for averaging purposes. To further complicate things, there are the "simulated jobs". These are part of the workloads that the simulator is running. This field "job" is how it relates to the config file and will be called "experiment_#"
exp
- This is the "experiment" that the job belongs to. Read "job" above for clarification.

How to run some simulations

Start the docker container: docker start batsim_docker
Start an interactive shell: docker exec -it batsim_docker /bin/bash
Change directory to "basefiles": cd /home/sim/basefiles
Choose a config file and optionally edit it (*below): nano ./configs/figure4_left_wl4.config
Set the config file you wish to run:file1=./configs/figure4_left_wl4.config
Set the output folder you wish the output to go to ( a new folder ): folder1=NAME
Run the simulation: python3 run_simulation.py --config $file1 --output ~/experiments/$folder1
Aggregate results if need be: python3 aggregate_makespan.py -i ~/experiments/$folder1
Process the results in ~/experiments/$folder1/total_makespan.csv

How To Edit Config File

* Instructions for editing config files can be seen by running the following commands:

view general info on config files: python3 generate_config.py --config-info general
view general info on sweeps: python3 generate_config.py --config-info sweeps
All --config-info options can be seen by running: python3 generate_config.py --help
- Under Required Options 1 -> --config-info <type> you can see the various types of info that is offered
- generate_config.py will not generate your config file for you. It is called that because it takes a config file that you will need to write and generates the underlying config files the simulator needs.

For example:

Run a modified Figure 4, left-hand subfigure for workload 4:
- file1=./configs/figure4_left_wl4.config
- folder1=fig4_left_wl4
- python3 run_simulation.py --config $file1 --output ~/experiments/$folder1
  - This command took about 1.5 hours on my computer
- python3 aggregate_makespan.py -i ~/experiments/$folder1
- Now divide each makespan_sec in total_makespan.csv by the makespan_sec for the baseline (the first job, "experiment_1") You should get roughly what is in Figure 4, left-hand subfigure for workload 4. Keep in mind, this is only 10 runs. The paper used 1500 runs.
  - The following will do these calculations for you:
```
baseline=`cat ~/experiments/$folder1/total_makespan.csv | awk -F, '(NR>1)''{print $5}' | awk '(NR==1)''{print}'` && cat ~/experiments/$folder1/total_makespan.csv | awk -F, '(NR>2)''{print $5}' | awk -v baseline=$baseline BEGIN'{i=1}''{ printf "%f\tworse failures:%dx\n", $1/baseline,2^i;i=i+1}'
```
    This is what I get running these commands:
```
1.015441        worse failures:2x
1.049923        worse failures:4x
1.100013        worse failures:8x
1.188432        worse failures:16x
1.328728        worse failures:32x
```

Monte Carlo

We used a cluster of 13 nodes with 30 processing cores on each node in order to get our results. This is mainly due to the fact that we wanted a statistical average for our makespans, as they are based on random events. In most simulations we did, we found the makespan to converge at around 200 runs. For added assurance we did 1500 runs for each datapoint in the paper. With each run usually taking a couple minutes, we found datapoints at a rate of about 1 every 10 minutes. In order to sweep the 1,2,4,8,16,32 SMTBF it would take around an hour or two for each workload. We were looking at around 8 or 10 hours for all 6 workloads for 13 day baseline, and longer but around the same for the 24 hour. (more errors equates to longer simulation time).

For this reason, you may want to consider a cluster for your tests. Our cluster uses SLURM and we leverage it for parallelization. These steps and instructions need to be understood so that you can adapt them to your linux distro and cluster situation.

How To Run Monte Carlo Simulations

How To Run Monte Carlo Simulations

Prepare

If you haven't yet, read over simulator / monte_carlo / README_FILES.txt to see what each file does.
In particular, look over deploy.sh and generate_config.py
You will undoubtedly need to edit deploy.sh for your needs.

Install

Change Directory to user directory
cd /home/$USER
Copy deploy.sh into user directory
cp ./monte_carlo/deploy.sh /home/$USER/deploy.sh
Copy patches of batsim and batsched to Downloads
cp ./monte_carlo/patch_batsim.patch ~/Downloads/patch_batsim.patch
cp ./monte_carlo/patch_batsched.patch ~/Downloads/patch_batsched.patch
Run deploy.sh . Again, this will need to be edited first
./deploy.sh

gcc: 10.2.0
Kernel: 3.10.0-1160.6.1.el7.x86_64
python: 3.6.8

How to Run A Simulation On Our AC-Cluster

Change Directory to simulator / monte_carlo
cd monte_carlo
if you don't have an experiments folder already
mkdir ~/experiments

Generic Example:

file1=./configs/configFileName.config
Make sure the folder name is different each time
folder1=~/experiments/configFileName
base=`pwd`

sbatch -p usrc-nd02 -N1 -n1 -c1 --output=./myBatch.log \
--export=folder1=$folder1,file1=$file1,basefiles=$base \
 ./myBatch

Actual Test Simulation (3 Runs of each simulation)

file1=./configs/MC_test.config
folder1=~/experiments/MC_test
base=`pwd`
Run this once. It changes a path that is needed for the simulation workload:
sed -i "s:TODO:$base/:g" ./configs/MC_test.config

sbatch -p usrc-nd02 -N1 -n1 -c1 --output=./myBatch.log \
--export=folder1=$folder1,file1=$file1,basefiles=$base \
 ./myBatch

Actual Simulation (1500 Runs of each simulation)

file1=./configs/MC.config
folder1=~/experiments/MC
base=`pwd`
Run this once. It changes a path that is needed for the simulation workload:
sed -i "s:TODO:$base/:g" ./configs/MC.config

sbatch -p usrc-nd02 -N1 -n1 -c1 --output=./myBatch.log \
--export=folder1=$folder1,file1=$file1,basefiles=$base \
 ./myBatch

Make sure myBatch is running (ie there were no json problems in the config file )
squeue --format="%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R %.120k"

Continue to squeue to check if jobs are completing and to tell when they have all been completed
squeue --format="%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R %.120k" | tail -n 10

Aggregate Results
python3 aggregate_makespan.py -i $folder1

Analyze ../experiments/$folder1/total_makespan.csv

group by "exp" : so you will have wl[1-6]_24hr and wl[1-6]_13d
- group by "job" in each "exp" grouping
  - The job "experiment_1" will be the baseline
  - The job "experiment_2" will be the 2x
  - The job "experiment_3" will be the 5x
- divide every "experiment_2"'s "makespan_sec" by "experiment_1"'s "makespan_sec"
- divide every "experiment_3"'s "makespan_sec" by "experiment_1"'s "makespan_sec"
- dividing can be done with an awk command
  - First get the baseline by replacing NR==1 by NR==# where # equals the line that is the baseline
    baseline=`cat ~/experiments/$folder1/total_makespan.csv | awk -F, '(NR==1)''{print $5}'`
  - Next do the division. Replace NR==2 with NR==# where # equals the line that you are dividing by the baseline:
```
cat ~/experiments/$folder1/total_makespan.csv | awk -F, -v baseline=$baseline '(NR==2)''{printf "%f",$5/baseline}'
```

INFO ON HOW TO CHANGE CONFIG FILES

This is covered in the docker sections, but here it is again:

python3 generate_config.py --config-info [ general | sweeps |
                                            node-sweep | SMTBF-sweep | checkpoint-sweep | checkpointError-sweep | performance-sweep |
                                            grizzly-workload | synthetic-workload |
                                            input-options | output ]

vivian-hafener-lanl/simulator