Environment
- OS: Ubuntu 20.04
- CUDA 11.6
- PyTorch 1.12
- Tested Hardware: RTX 2080Ti, RTX 3090, T4, V100
Initializing Environment Variables
source init.sh
We train a NN-based runtime predictor for complicated (non-elementwise) operators like conv2d
, linear
, batchnorm
, maxpool
and bmm
.
Performance Data Collection
python perfpred/measure.py [--num_gpus 1] [--use_amp] [--data-dir ./data] [--cooldown 0.01]
Model Training
python perfpred/predictor.py {conv2d,mm,batchnorm,maxpool2d,bmm}
We need a modified version of PyTorch, which makes two changes to PyTorch:
- Replace the memory allocator with a dummy allocator;
- Stop launching most kernels.
After all, we still have an option of non-intrusive mode of memory predictor. However, it might not be as accurate as the intrusive one.
Building fake cuda runtime
bash scripts/build.sh
As an important baseline we compared to, we reimplemented DNNPerf.
python dnnperf/fake_runner.py 1003 115 224 set cuda api_failures stop