Efficient Loss Calculation with `all_gather` to Achieve Even More Batch Size
Closed this issue · 1 comments
trapoom555 commented
With this approach, the batch size can linearly scale with the number of GPUs
trapoom555 commented
- Improved training time (1k steps) from 1:30 hrs to 15 mins
x4
batch sizes (according to the number of GPUs in the system)