This repository contains the code used in the ICML 2021 paper: "Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size". This source code is available under the MIT License.
- config: Configuration files used to describe each dataset/model
- data: Directory to which the CIFAR and NoScope datasets are to be saved
- datasets: Implementations of the curriculum learning and distillation datasets used in the paper
- dockerfiles: Dockerfiles used for building Docker containers to be used when running training and inference experiments
- model_files: Paths to pretrained "teacher" model for distillation in CIFAR-10.
- models: PyTorch implementation of models
- util: Utility scripts
This section details the access to the datasets used in this repository
The dataset from the game-scraping workload described in Section 2 and used in the evaluation of the paper may be downloaded from this URL: https://figshare.com/s/71fd0b25dbed73183079.
You can un-tar the dataset by running:
tar -xf game_data.tar
The resultant directory will be ~1.6 GB in size.
The NoScope videos can be downloaded from the project repository.
The code used by our project to split the dataset into training, validation, and test sets is located
in util. You will need cv2
version 3.4.1 installed. For example, to
generate the noscope-coral
dataset, download the coral-reef-long.mp4
video
from the link above and run:
cd util
python3 noscope_video_save.py /path/to/coral-reef-long.mp4 noscope-coral-frames-all
python3 noscope_split.py coral noscope-coral-subset --infile /path/to/coral-reef-long.csv --indir noscope-coral-frames-all
mv noscope-coral-subset ../data
This will require that you have at least 100 GB of storage. You can remove the
noscope-frames-all
directory after successfully splitting the dataset.
The CIFAR-10 and CIFAR-100 datasets will be downloaded using the torchvision CIFAR dataloader.
- NVIDIA V100 GPU (we tested on an AWS p3.2xlarge instance) or NVIDIA T4 GPU (we tested on an AWS g4dn.xlarge instance)
- NVIDIA Docker
- CUDA 10.2
- NVIDIA Driver 418.87.01
- Other requirements are satisfied by the provided Dockerfile
You are encouraged to set the absolute path to this repository, as well as
that to the un-tarred game_data
above:
export FOLD_HOME=$(pwd)
export DATA_ROOT=/path/to/game_data
To build the Docker image used in evaluation, perform the following:
cd $FOLD_HOME/dockerfiles
docker build -t fold -f FoldDockerfile .
You can then launch a Docker container with this image via:
docker run -it --rm --gpus all --shm-size=1g --ulimit memlock=-1 \
--ulimit stack=67108864 --privileged=true \
-v ${FOLD_HOME}:/workspace/folding \
-v ${DATA_ROOT}:/workspace/folding/data_root \
fold:latest
You should find yourself in the /workspace
directory with this repository
under /workspace/folding
. Navigate to /workspace/folding
for the remainder
of the steps.
The logic for training a FoldedCNN occurs in train.py and
fold_trainer.py. train.py
orchestrates the training of
many models, and fold_trainer.py
trains a single model.
train.py
is currently configured to begin running all training
experiments described in the paper. If you would like to run fewer training
runs, you can edit the datasets_to_run
and folds
variables in __main__
.
To train, run:
python3 train.py savedir
This will print training status like the following for each dataset and fold combination:
Epoch 0. train. Top-1=0.2070, Top-5=0.6214, Loss=2.1715: 20%|##### | 687/3438 [00:06<00:18, 149.32it/s]
The accuracies and model checkpoints used in a particular run will be saved
under the savedir
directory passed in to train.py
above.
Inference experiments can be performed using the run_inference.sh script. By default, this will run 10 trials of all models, batch sizes, and fold values considered in evaluation. To change which configurations are run, edit run_inference.sh.
You can run the script as follows:
# Pipe stderr to a file so as to surpress unrelated PyTorch warning about ffmpeg
./run_inference.sh results.csv 2> stderr.txt
This will write to stdout lines of the form:
Model,Trial,Fold,Batch Size,Mode,Throughput,FLOPs/sec
lol/goldnumber-fraction,1,1,1024,Original,YYY,ZZZ
where YYY
and ZZZ
are the throughput (in images/sec) and FLOPs/sec
achieved in this particular configuration.
These results will also be saved to results.csv
, the file indicated in
the invocation of run_inference.sh
above.
The T4 GPU is known to face performance throttling issues due to overheating. To avoid these events, when running experiments on the T4, we lock the GPU clock frequency to avoid overheating (as suggested in a related NVIDIA repository). These commands are not needed for the V100 experiments:
nvidia-smi -i 0 -pm 1
sudo nvidia-smi -lgc 900 -i 0
The FLOP count of each model used in inference evaluation is calculated
using thop, which is installed
in the provided Dockerfile. The flop_count.sh script can
be used to retrieve the FLOP count for models and fold values used in evaluation.
The results of running this script are also saved in config/model_map.json.
Note that the results of this script show the number of FLOPs performed with one
input, but for FoldedCNNs, this corresponds to f
images. For FoldedCNNs,
this therefore represents the number of FLOPs performed for f
images, rather
than for one.
The script ai.py can be used to calculate the arithmetic intensities of original CNNs, FoldedCNNs, and CNNs transformed by EfficientNet-style compound scaling.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.