Why MPI.sum in sync_grad (utils.py)

Question

Why MPI.sum in sync_grad (utils.py)

sritee opened this issue 5 years ago · 1 comments

Why do you sum rather than average the gradients in sync_grads? Won't this result in different learning rates when you run different number of processes?

Answer 1 · 2019-08-26T07:20:33.000Z

@sritee Yes, It will only result in different learning rates. Because I have tried it with both sum and average. I found sum can achieve better results. From my own opinion (maybe not correct) - when we sum gradients from each MPI workers, we can get "strong" update direction (you can also think it's a process of denoising). In this case, we can use "large" learning rate to accelerate the training.