Experience from "Accurate, Large Minibatch SGD"

Question

Experience from "Accurate, Large Minibatch SGD"

hma02 opened this issue 7 years ago · 0 comments

According to the Facebook paper, there are several implementation details to be adjusted:

1. Momentum correction. In our implementation, we used the equation (10) without momentum correction. We should either add momentum correction or change to equation (9).
2. Gradient aggregation. In our implementation, we used either weight averaging (avg) and summing gradient (cdd), neither of which normalizes the per-worker loss by total minibatch size kn, but by per-worker size n. We should consider averaging gradient and scaling up lr.
3. Learning rate gradual warm-up and linear scaling. The reason we didn't scale lr up was that when I tried this, gradient explosion happened at the beginning of training (even with a small number of workers) for VGG16. Note gradual warmup is increasing lr on every iteration rather than every epoch.
4. Batch Normalization parameters. According to the paper: "the BN statistics should not be computed across all workers". We should explicitly not do parameter exchanging on those BN parameters.
5. Use HeNormal initialization for ConvLayers and Normal for the last FCLayer. Set gamma to 0 for the last BN of each Residual Block.
6. Do multiple trials for reporting random variation. Median error of the final 5 epochs. Mean and standard deviation of the error from 5 independent runs. Each run is 90 epochs. lr/10 happens at 30, 60 and 80 epochs.
7. Use scale and aspect ratio data augmentation and normalize image by the per-color mean and std.

On the HPC side, the three phase allreduce "NCCL(reduction) -> MPI_Allreduce -> NCCL(broadcast)" mentioned in the paper can possibly be replaced by one NCCL2 operation. Or we need to make a python binding of Gloo?

The parallel communication idea mentioned in section 4 of the paper,

To allow for near perfect linear scaling, the aggregation must be performed in parallel with backprop

needs support from Theano. Currently, computation and communication are in serial in Theano-MPI.