This repository contains the artifact for our ASPLOS '23 paper "Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs". It includes following parts:
-
simulation
: It contains code and data for reproducing key results in our paper. -
workloads
: The Pytorch implementation of 14 different workloads used in experiments. -
profile
: It contains the code to collect traces of each training job type.
simulation
(adopted from Helios) contains instructions for reproducing the Venus
cluster experiments shown in Section 4. These scripts have been tested on Ubuntu 20.04 with Python 3.9.
The contents inside simulation
folder are summarized as follows:
- data/ contains
Venus
cluster job trace and cluster configuration used for evaluation. - analyzer/ contains the Packing Analyze Model and profiled workloads information used in our experiment.
- estimator/ contains the Workload Estimate Model and job duration estimation for both Lucid and QSSF.
- plot/ contains notebook for visualizing experiment results.
- policy/ contains implementations of the Lucid scheduling policy, and baseline policies including FIFO, SJF, QSSF, Tiresias.
- predictor/ contains the Throughput Predict Model and cluster throughput estimation in Venus September.
- profiler/ contains the Least-GPU-First and Auto-Scaling Profiler implementation for Lucid.
- cluster.py, job.py and updater.py contain implementations of the GPU cluster and workload logic.
- simulator.py is the main entry of the simulator.
We suggest using a conda environment to install the dependencies:
conda create -n lucid python=3.9
conda activate lucid
cd simulation
pip install -r requirements.txt
Besides, we recommend execute Jupyter notebook (.ipynb
) files with VSCode or JupyterLab (conda install jupyterlab
).
We train Throughput Predict Model as a reproduction example. Please follow below steps:
-
Enter
predictor
folder and openpredictor.ipynb
file -
Run all cells inside the notebook. It contains the interpretable model (Primo EBM) used in Lucid and other ML baselines (LightGBM, XGBoost, Random Forest, DNN).
-
Table 7: Interpretable Model Performance: Check
Result Comparison
cell, the MAE scores of all baselines are listed. -
Figure 13 (a): Throughput Predict Performance: Check
Prediction Visualization
cell (orVenus_throughput.pdf
output file), both the real and predicted throughput are plotted. Generated figures should have similar patterns as the paper. The difference is because we release the Venus Job throughput prediction code but we plot Saturn Job throughput prediction in our paper. -
Figure 7 (a)(b): Global Model Interpretation and Learned Shape Function: Check
Model Interpretation
cell (orinterpret_Venus_throughput.pdf
&interpret_Venus_shapefunc.pdf
output files). Generated figures should have similar patterns as the paper. The difference is because we release the Venus Job throughput prediction code but we plot Saturn GPU throughput prediction in our paper.
More model training codes are also provided (estimator/estimator_lucid.ipynb
and analyzer/analyzer.py
).
Use the following command to run all baselines simultaneously
cd simulation
python simulator.py --sweep
The output of this script looks like this:
2022 Oct 08 14:32:57 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13220000 | Total Job: 7603 | End job: 13 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13220000 | Total Job: 2826 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13230000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13230000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13240000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13220000 | Total Job: 2654 | End job: 1 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13240000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13250000 | Total Job: 7603 | End job: 121 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13220000 | Total Job: 1452 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13250000 | Total Job: 2826 | End job: 0 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13230000 | Total Job: 2654 | End job: 2 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13260000 | Total Job: 7603 | End job: 162 | Running job: 9 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13220000 | Total Job: 710 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13230000 | Total Job: 1452 | End job: 1 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13230000 | Total Job: 710 | End job: 0 | Running job: 1 | Pending job: 0
Similarly, use the following command to run all baselines simultaneously
python simulator.py -s lucid
The output of this script looks like this:
2022 Oct 08 14:45:07 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13220000 | Total Job: 23859 | End job: 17 | Running job: 1 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13230000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13240000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13250000 | Total Job: 23859 | End job: 136 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13260000 | Total Job: 23859 | End job: 249 | Running job: 3 | Pending job: 4 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13270000 | Total Job: 23859 | End job: 385 | Running job: 3 | Pending job: 2 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13280000 | Total Job: 23859 | End job: 589 | Running job: 2 | Pending job: 0 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13290000 | Total Job: 23859 | End job: 780 | Running job: 2 | Pending job: 0 | Avail Nodes: 2
After the program is executed, you can check the result in the log
folder. The job log and time sequence of each VC are provided separately.
We provide simulation analysis and plot scripts to generate the figures shown in our paper. Please follow below steps:
-
Enter
plot
folder and openresult_plot.ipynb
file -
Run all cells inside the notebook.
-
Table 4: Scheduling Performance: Check
Table 4: Result Summary
cell (orresult_summary.csv
output file), the Average JCT, Average Queuing Delay and Queuing Delay 99.9 Quantile of all policies are listed. -
Table 5: Scheduling Performance (workload analysis): Check
Table 5: Result Summary of Different Scales of Workloads
cell, the Average JCT, Average Queuing Delay of large and small jobs are listed. -
Figure 8: CDF of JCT: Check
Plot Result 8: JCT
cell (orresult_cdf_jct.pdf
output file), JCT CDF of all policies are plotted. -
Figure 9: Queue Time in each VC: Check
Plot Result 9: Queue Time in each VC
cell (orresult_bar_queue.pdf
output file), queuing delay of all policies are plotted.
This part profile
contains code for profiling metrics of multiple workloads.
Note that ./result/
will be created when main_co.py
or main_single.py
is launched.
Run main_co.py
will generate the colocated jobs' metrics under ./result/colocate
. Run main_single.py
will generate single jobs' metrics under ./result/
. Some specific settings can be set in each workload's profiling file, e.g.profile_cifar.py
. The output will be like this:
imagenet + imagenet
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
imagenet + cifar10
co-locate:
Files already downloaded and verified
==> Training ResNet18 model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
...
The data path storing all datasets is specified in ./workloads/settings.py
as data_dir
. You can also specify the total runtime of some workloads by changing total_runtime
.
-
CIFAR-10: The cifar10 dataset will be downloaded automatically(if not exist) when
./workloads/cifar/profile_cifar.py
is run. -
ImageNet: The dataset is generated automatically in
./workloads/imagenet/profile_imagenet.py
. -
LSUN: The dataset is generated automatically in
./workloads/dcgan/profile_dcgan.py
. You can change the custom image size of generated data via--imageSize
. The default value is 64. -
ShapeNet: Use the following command to download dataset under directory
data_dir/shapenetcore/
:wget https://shapenet.cs.stanford.edu/ericyi/shapenetcore_partanno_segmentation_benchmark_v0.zip --no-check-certificate unzip shapenetcore_partanno_segmentation_benchmark_v0.zip
-
SQuAD: The data can be downloaded with the following link and should be saved under
data_dir/SQUAD_DIR/
directory. -
Wikitext2: The dataset can be downloaded from
File
test.txt
,train.txt
andvalid.txt
should be saved indata_dir/wikitext-2/
directory. -
Multi30k: First download the Moses tokenizer(http://www.statmt.org/moses/) for data preparation:
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl
These files should be downloaded in
./workloads/translation/
.Then download data in
data_dir/multi30k/
:mkdir -p data/multi30k wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz && tar -xf training.tar.gz -C data/multi30k && rm training.tar.gz wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz && tar -xf validation.tar.gz -C data/multi30k && rm validation.tar.gz wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz && tar -xf mmt16_task1_test.tar.gz -C data/multi30k && rm mmt16_task1_test.tar.gz
Preprocess the data:
for l in en de; do for f in ~/data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done for l in en de; do for f in ~/data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done python preprocess.py -train_src ~/data/multi30k/train.en.atok -train_tgt ~/data/multi30k/train.de.atok -valid_src ~/data/multi30k/val.en.atok -valid_tgt ~/data/multi30k/val.de.atok -save_data ~/data/multi30k.atok.low.pt
Referenced from: https://github.com/Eathoublu/attention-is-all-you-need-pytorch.
-
MovieLens: Use the following command to download the dataset in
data_dir/ml-1m/
:wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.negative wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.rating wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.train.rating