Artifact for SC '21

This repository contains the artifact for the SC '21 paper "Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters". It includes following four parts:

enviornment: The experimental environment in Appendix: Artifact Description/Artifact Evaluation.
data: Helios traces download from HeliosData.
analysis: It contains scripts for analyzing traces.
framework: It contains QSSF Service and CES Service scripts

We have update notebooks and scripts for full four traces!

Detailed Introduction

`enviornment`

Provide details on the experimental environment as shown in Appendix: Artifact Description/Artifact Evaluation.

collect_environment.sh: Gather execution environment information for GPU compute node and analysis platform.
env_analysis_platform: Execution environment information for trace analysis platform.
env_datacenter_node: Execution environment information for GPU compute node in our datacenter (from Volta Cluster).

Summary

	Analysis Platform	Datacenter Node
System	Ubuntu 20.04 LTS	CentOS 7.4
CPU	Intel Core i9-10900	2 x Intel Xeon Gold 6146
Memory	32GB DDR4	376GB DDR4
GPU	GeForce RTX 2080 Ti	8 x Tesla V100-SXM2
Network	Ethernet	InfiniBand EDR

`data`

Initially, this folder is NOT exist. You need to download and unzip the dataset from HeliosData. After that, this folder structure should be:

📦data
 ┣ 📂Earth
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┣ 📂Saturn
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┣ 📂Uranus
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv
 ┗ 📂Venus
 ┃ ┣ 📜cluster_gpu_number.csv
 ┃ ┗ 📜cluster_log.csv

`analysis`

Contains parsing and plotting code to analyze traces.

compare with Philly trace: Figure 1: Comparisons of job characteristics between Helios and Philly.
cluster characterization: Figure 2: Daily pattern of the cluster usage in Helios.

Figure 3: Monthly trends of cluster activities in Helios.

Figure 4: The boxplot of utilization distributions for thetop 10 largest VCs of Earth in May (sorted by size).
job characterization: Figure 5: CDF of GPU (a) and CPU (b) job duration.

Figure 6: The CDFs of job sizes (in GPU number) with the number of jobs (a) and GPU time (b).

Figure 7: Distribution of jobs by their final statuses.
user characterization: Figure 8: The CDFs of users that consume the cluster resources in terms of (a) GPU Time (b) CPU Time.

Figure 9: (a) CDFs of users w.r.t. GPU job queuing delay. (b)Distributions of user GPU job completion ratios.

`framework`

An prediction-based GPU resource management framework.

This folder contains QSSF Service and CES Service scripts and related data.

Quick Start

These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform).

Here are the step-by-step instructions for artifact.

Preparing

Download Helios artifact and data repository.

git clone git@github.com:S-Lab-System-Group/HeliosArtifact.git
cd HeliosArtifact

git clone git@github.com:S-Lab-System-Group/HeliosData.git
unzip ./HeliosData/data.zip -d ./

Check software dependencies:

For the analysis part, JupyterLab / JupyterNotebook is needed.

For the other python libraries used in this project, you can check requirements.txt.

Reproducing `analysis`

Prepare and parse the trace files for analyzing.
```
cd analysis
python ./trace_parser.py
```
After generating all required data, you can analyze traces through .ipynb files within 4 sub-folders of analysis:1_compare with Philly trace, 2_cluster characterization, 3_job characterization, 4_user characterization.

These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper.

Reproducing `framework`

`QSSF Service`

Before executing the simulation of QSSF service, data preparation is needed.

It generates VC configuration and job trace for each cluster.
```
cd framework/QSSF\ Service/data
bash prepare_data.sh 
```
Then, you can run all scheduling policies on Philly trace using sweep mode, as below:
```
cd ..
python simulator.py -e='Philly' -t='./data/Philly' --sweep 
```
See run.sh for more usage examples on Helios. Note that since we do not release job name information, the estimator and qssf policy are not available for Helios.
After the program is executed, you can check the result in the log folder. The job log and time sequence of each VC are provided separately.
Besides, we provide simulation analysis and plot script in plot.

You can generate Figure 13 in the paper through this script.

`CES Service`

Run CES simulation on Helios:
```
cd framework/CES\ Service
python CES_Helios.py
```
You can specify different cluster in the script and adjust the different configurations of CES service by the hyperparameter function.
Similarly, run CES simulation on Philly:
```
python CES_Philly.py
```
From the code output and generated figures helios_ces (Figure 14) & philly_ces (Figure 15), we can analyze the CES service performance in detail.

S-Lab-System-Group/HeliosArtifact

Artifact for SC '21

Detailed Introduction

enviornment

data

analysis

framework