Different understanding of `Shuffling BN`
wandering007 opened this issue ยท 2 comments
From the original paper,
For the key encoder f_k, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding); the sample order of the mini-batch for the query encoder f_q is not altered.
I understand that the BNs in the key encoder do not have to be modified if inputs to the network are already shuffled.
Besides, I cannot understand why this trick works. it will be appreciated if you could share your understanding on it.
Hi, thank you for the advice.
This repo is not aiming at reproducing the MoCo paper. Instead, it's just a tool probably useful for other self-supervised learning researchers. If you want a MoCo reimplementation, I suggest https://github.com/HobbitLong/CMC/blob/master/train_moco_ins.py
This trick eliminates the information leak between samples in the same batch. Because the task in MoCo (and many other contrastive loss works) is to discriminate 1 sample from a batch of samples. Ideally, when the model learns its feature, it should not 'see' other samples in the same batch (i.e. sample feature should be extracted independently), otherwise the model can learn some shortcut solution to fit the contrastive task. However, BatchNorm breaks this independence.
This problem was mentioned in CPC page 6 Sect 3.2; CPCv2 page 6, they use LayerNorm instead; DPC page 4, we find BN doesn't affect performance much. But MoCo's solution is clean and simple.
Hi, thank you for the advice.
This repo is not aiming at reproducing the MoCo paper. Instead, it's just a tool probably useful for other self-supervised learning researchers. If you want a MoCo reimplementation, I suggest https://github.com/HobbitLong/CMC/blob/master/train_moco_ins.pyThis trick eliminates the information leak between samples in the same batch. Because the task in MoCo (and many other contrastive loss works) is to discriminate 1 sample from a batch of samples. Ideally, when the model learns its feature, it should not 'see' other samples in the same batch (i.e. sample feature should be extracted independently), otherwise the model can learn some shortcut solution to fit the contrastive task. However, BatchNorm breaks this independence.
This problem was mentioned in CPC page 6 Sect 3.2; CPCv2 page 6, they use LayerNorm instead; DPC page 4, we find BN doesn't affect performance much. But MoCo's solution is clean and simple.
What do you mean by "it should not 'see' other samples"? By using BN, the output is (X- X.mean())/X.std(), and the output do not contain any information about the location of the positive sample. In this way, why BN prevent a good result?