Question regarding BatchNorm1dNoBias

Question

Question regarding BatchNorm1dNoBias

nimaous opened this issue 4 years ago · 4 comments

Hi

thanks for your nice job. Can you possibly explain what is the use of adding BatchNorm1dNoBias after the laster linear layer of the projection head (https://github.com/AndrewAtanov/simclr-pytorch/blob/master/models/encoder.py#L40)

Answer 1 · 2021-07-14T09:53:51.000Z

Hi Nima,

Thanks for the interest. I cannot recall where the reference for that is, or maybe I'm confused, and there is no such. BN w/o bias term is also used in the official implementation [1], and, I guess, this is where I found this.

My intuition is as follows: in the contrastive NT-Xent loss, the cosine similarity is used, so image embeddings are treated as centered at 0. The bias term will break this assumption and also add the additive term in the embeddings interaction (the dot-product becomes: x^Ty + (x+y)^Tb + b^Tb). I'm not sure, though, how important this detail is for pretraining, it doesn't seem like a hard constraint from the theoretical point of view. If I recall correctly, there were issues when pretraining w/ the bias term in practice.

[1] https://github.com/google-research/simclr/blob/dec99a81a4ceccb0a5a893afecbc2ee18f1d76c3/model_util.py#L141

Answer 2 · 2021-07-14T10:34:12.000Z

Andrew, thanks a lot for your reply. I can understand why to set bias=Flase and your intuition completely make sense for me. but my question is why use a BN as the last layer instead of a linear layer? isn't that the cosine similarity does the centering and makes a uni hypersphere latent space?

Answer 3 · 2021-07-14T12:13:08.000Z

Oh, I see. I think this is in part to avoid feature collapsing when all the vectors are pointing almost in the same direction. This effect is studied here [1], for example.

[1] https://generallyintelligent.ai/understanding-self-supervised-contrastive-learning.html

Answer 4 · 2021-07-14T13:30:54.000Z

thanks. that was helpful