Distributed training no faster... just gets more conservative
Opened this issue · 3 comments
After training this overnight on 10 GPUs, it's making no faster progress than when training on two.
Looking in to why this is occurring, it looks like the distributed process simply does the same workload, with different inputs, then averages the gradients before learning. In other words, as you add more GPUs it just gets more conservative.
Now this can be corrected for by tweaking the learning rate, but it would be nice if this weren't required, as trying to ramp this up to larger scale has effectively just wasted money (at least at the start of training, where overshoot issues are unlikely).
What would be the issues with totalling gradients rather than averaging?
Summing the gradients will make the magnitude of the gradients dependent on the batch size. This way your learning rate will be coupled with the batch size, which is not good.
https://stats.stackexchange.com/questions/358786/mean-or-sum-of-gradients-for-weight-updates-in-sgd
Hmm, actually I've assumed a mean of zero there, which will not initially be the case (will become the case as we converge). My analysis might be a bit off there. Point stands though that you're already messing with another parameter (the batch size) - so it's a matter of deciding which parameter should be messed with and perhaps documenting that better.