Cerebras/online-normalization

For data parallel, does each GPU hold different running statistics using online norm?

zbh2047 opened this issue · 1 comments

For batch normalization, there are two implementations for multi-GPU training. One approach is to use mean and variance within each GPU. Another approach is to use the whole batch statistics. It needs to calculate cross-GPU mean and variance and then aggregate these results. The latter approach needs GPU communication. Pytorch implements the latter approach which is known as SyncBatchNorm.

However, online norm is different from batch norm. The training with batch norm does not use running statistics while online norm does. It seems that the current implementation of online normalization is the former approach. I wonder if this will lead to instability that harms the training performance, since online normalization uses runing statistics and may be less stable than batch normalization (I guess?).

We did not do any multi-gpu experiments and never developed a multi-GPU training strategy.