Requires nightly torch
build and pippy
to be installed:
conda install pytorch pytorch-cuda=12.1 -c pytorch-nightly -c nvidia
pip install git+https://github.com/pytorch/PiPPy accelerate transformers
Tests were ran with a batch size of n_gpus
. In this case they were ran on
two 4090's so a batch size of 2 (or 1 per each GPU when split).
When Accelerate
was used, a device map was generated that could roughly split
the model evenly between each GPU
Time Elapsed (s) | Accelerate/Sequential | PiPPy Example | PiPPy + Accelerate (automated) |
---|---|---|---|
First batch | 0.2986 | 0.2313 | 0.202 |
Avg for the rest | 0.03167 | 0.02434 | 0.0132 |