Distributed training no faster... just gets more conservative

Question

Distributed training no faster... just gets more conservative

Opened this issue 5 years ago · 3 comments

tim-goddard-insomniasec commented 5 years ago

After training this overnight on 10 GPUs, it's making no faster progress than when training on two.

Looking in to why this is occurring, it looks like the distributed process simply does the same workload, with different inputs, then averages the gradients before learning. In other words, as you add more GPUs it just gets more conservative.

Now this can be corrected for by tweaking the learning rate, but it would be nice if this weren't required, as trying to ramp this up to larger scale has effectively just wasted money (at least at the start of training, where overshoot issues are unlikely).

What would be the issues with totalling gradients rather than averaging?

Answer 1 · 2020-06-25T21:11:52.000Z

Summing the gradients will make the magnitude of the gradients dependent on the batch size. This way your learning rate will be coupled with the batch size, which is not good.
https://stats.stackexchange.com/questions/358786/mean-or-sum-of-gradients-for-weight-updates-in-sgd

Answer 2 · 2020-06-25T22:10:27.000Z

Yes, if you read that article you'll note that it's talking the mean of the *batch*, not a larger set. When adding extra GPUs under your current arrangement, you're essentially increasing this batch size by a multiple of the number of GPUs, which reduces the effective learning rate. Training waveglow on multiple GPUs appears to provide zero value beyond using a single one in your current implementation. This is confusing and wastes resources. If you need to constrain the magnitude of the gradients, I think you could accomplish this by summing, and dividing by the square root of the number of GPUs. Why? Because the variance of a sum of distributions is the sum of the variances, (e.g. scaling to 4 GPUs would increase the variance 4 times), and the standard deviation (which characterises the likelihood of a deviation of any given magnitude) is the square root of this (e.g. scaling to 4 GPUs would double this). Summing the values and dividing by the square root of the distributions summed should therefore produce a consistent distribution of gradients.

…

On Fri, 26 Jun 2020 at 09:12, Rafael Valle ***@***.***> wrote: Summing the gradients will make the magnitude of the gradients dependent on the batch size. This way your learning rate will be coupled with the batch size, which is not good. https://stats.stackexchange.com/questions/358786/mean-or-sum-of-gradients-for-weight-updates-in-sgd — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#186 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKYQUZRT4VHLBDGKGBU4XN3RYO4SLANCNFSM4LGV5SCA> .

Answer 3 · 2020-06-25T22:16:04.000Z

Hmm, actually I've assumed a mean of zero there, which will not initially be the case (will become the case as we converge). My analysis might be a bit off there. Point stands though that you're already messing with another parameter (the batch size) - so it's a matter of deciding which parameter should be messed with and perhaps documenting that better.