For data parallel, does each GPU hold different running statistics using online norm?

Question

For data parallel, does each GPU hold different running statistics using online norm?

zbh2047 opened this issue 4 years ago · 1 comments

zbh2047 commented 4 years ago

For batch normalization, there are two implementations for multi-GPU training. One approach is to use mean and variance within each GPU. Another approach is to use the whole batch statistics. It needs to calculate cross-GPU mean and variance and then aggregate these results. The latter approach needs GPU communication. Pytorch implements the latter approach which is known as SyncBatchNorm.

However, online norm is different from batch norm. The training with batch norm does not use running statistics while online norm does. It seems that the current implementation of online normalization is the former approach. I wonder if this will lead to instability that harms the training performance, since online normalization uses runing statistics and may be less stable than batch normalization (I guess?).

Answer 1 · 2020-11-11T03:15:02.000Z

We did not do any multi-gpu experiments and never developed a multi-GPU training strategy.