Performance Prediction of Deep Learning Models with Hardware-ware Optimizations

Environment

Initializing Environment Variables

source init.sh

Time Prediction

We train a NN-based runtime predictor for complicated (non-elementwise) operators like conv2d, linear, batchnorm, maxpool and bmm.

Performance Data Collection

python perfpred/measure.py [--num_gpus 1] [--use_amp] [--data-dir ./data] [--cooldown 0.01]

Model Training

python perfpred/predictor.py {conv2d,mm,batchnorm,maxpool2d,bmm}

We need a modified version of PyTorch, which makes two changes to PyTorch:

After all, we still have an option of non-intrusive mode of memory predictor. However, it might not be as accurate as the intrusive one.

Building fake cuda runtime

bash scripts/build.sh

As an important baseline we compared to, we reimplemented DNNPerf.

python dnnperf/fake_runner.py 1003 115 224 set cuda api_failures stop