This repository contains the code used in the ICML 2021 paper: "Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size". This source code is available under the MIT License.

Repository structure

  • config: Configuration files used to describe each dataset/model
  • data: Directory to which the CIFAR and NoScope datasets are to be saved
  • datasets: Implementations of the curriculum learning and distillation datasets used in the paper
  • dockerfiles: Dockerfiles used for building Docker containers to be used when running training and inference experiments
  • model_files: Paths to pretrained "teacher" model for distillation in CIFAR-10.
  • models: PyTorch implementation of models
  • util: Utility scripts

Downloading datasets

This section details the access to the datasets used in this repository

Game-scraping workload

The dataset from the game-scraping workload described in Section 2 and used in the evaluation of the paper may be downloaded from this URL: https://figshare.com/s/71fd0b25dbed73183079.

You can un-tar the dataset by running:

tar -xf game_data.tar

The resultant directory will be ~1.6 GB in size.


The NoScope videos can be downloaded from the project repository. The code used by our project to split the dataset into training, validation, and test sets is located in util. You will need cv2 version 3.4.1 installed. For example, to generate the noscope-coral dataset, download the coral-reef-long.mp4 video from the link above and run:

cd util
python3 noscope_video_save.py /path/to/coral-reef-long.mp4 noscope-coral-frames-all
python3 noscope_split.py coral noscope-coral-subset --infile /path/to/coral-reef-long.csv --indir noscope-coral-frames-all
mv noscope-coral-subset ../data

This will require that you have at least 100 GB of storage. You can remove the noscope-frames-all directory after successfully splitting the dataset.

CIFAR-10 and CIFAR-100

The CIFAR-10 and CIFAR-100 datasets will be downloaded using the torchvision CIFAR dataloader.

Software and Requirements

  • NVIDIA V100 GPU (we tested on an AWS p3.2xlarge instance) or NVIDIA T4 GPU (we tested on an AWS g4dn.xlarge instance)
  • NVIDIA Docker
  • CUDA 10.2
  • NVIDIA Driver 418.87.01
  • Other requirements are satisfied by the provided Dockerfile

You are encouraged to set the absolute path to this repository, as well as that to the un-tarred game_data above:

export FOLD_HOME=$(pwd)
export DATA_ROOT=/path/to/game_data

To build the Docker image used in evaluation, perform the following:

cd $FOLD_HOME/dockerfiles
docker build -t fold -f FoldDockerfile .

You can then launch a Docker container with this image via:

docker run -it --rm --gpus all --shm-size=1g --ulimit memlock=-1 \
       --ulimit stack=67108864 --privileged=true \
       -v ${FOLD_HOME}:/workspace/folding \
       -v ${DATA_ROOT}:/workspace/folding/data_root \

You should find yourself in the /workspace directory with this repository under /workspace/folding. Navigate to /workspace/folding for the remainder of the steps.


The logic for training a FoldedCNN occurs in train.py and fold_trainer.py. train.py orchestrates the training of many models, and fold_trainer.py trains a single model.

train.py is currently configured to begin running all training experiments described in the paper. If you would like to run fewer training runs, you can edit the datasets_to_run and folds variables in __main__.

To train, run:

python3 train.py savedir

This will print training status like the following for each dataset and fold combination:

Epoch 0. train. Top-1=0.2070, Top-5=0.6214, Loss=2.1715:  20%|#####          | 687/3438 [00:06<00:18, 149.32it/s]

The accuracies and model checkpoints used in a particular run will be saved under the savedir directory passed in to train.py above.


Inference experiments can be performed using the run_inference.sh script. By default, this will run 10 trials of all models, batch sizes, and fold values considered in evaluation. To change which configurations are run, edit run_inference.sh.

You can run the script as follows:

# Pipe stderr to a file so as to surpress unrelated PyTorch warning about ffmpeg
./run_inference.sh results.csv 2> stderr.txt

This will write to stdout lines of the form:

Model,Trial,Fold,Batch Size,Mode,Throughput,FLOPs/sec

where YYY and ZZZ are the throughput (in images/sec) and FLOPs/sec achieved in this particular configuration.

These results will also be saved to results.csv, the file indicated in the invocation of run_inference.sh above.

Notes when running inference on a T4 GPU

The T4 GPU is known to face performance throttling issues due to overheating. To avoid these events, when running experiments on the T4, we lock the GPU clock frequency to avoid overheating (as suggested in a related NVIDIA repository). These commands are not needed for the V100 experiments:

nvidia-smi -i 0 -pm 1
sudo nvidia-smi -lgc 900 -i 0

Other contents of this repository

The FLOP count of each model used in inference evaluation is calculated using thop, which is installed in the provided Dockerfile. The flop_count.sh script can be used to retrieve the FLOP count for models and fold values used in evaluation. The results of running this script are also saved in config/model_map.json. Note that the results of this script show the number of FLOPs performed with one input, but for FoldedCNNs, this corresponds to f images. For FoldedCNNs, this therefore represents the number of FLOPs performed for f images, rather than for one.

The script ai.py can be used to calculate the arithmetic intensities of original CNNs, FoldedCNNs, and CNNs transformed by EfficientNet-style compound scaling.

