Question regarding BatchNorm1dNoBias
nimaous opened this issue · 4 comments
Hi
thanks for your nice job. Can you possibly explain what is the use of adding BatchNorm1dNoBias after the laster linear layer of the projection head (https://github.com/AndrewAtanov/simclr-pytorch/blob/master/models/encoder.py#L40)
Hi Nima,
Thanks for the interest. I cannot recall where the reference for that is, or maybe I'm confused, and there is no such. BN w/o bias term is also used in the official implementation [1], and, I guess, this is where I found this.
My intuition is as follows: in the contrastive NT-Xent loss, the cosine similarity is used, so image embeddings are treated as centered at 0. The bias term will break this assumption and also add the additive term in the embeddings interaction (the dot-product becomes: x^Ty + (x+y)^Tb + b^Tb). I'm not sure, though, how important this detail is for pretraining, it doesn't seem like a hard constraint from the theoretical point of view. If I recall correctly, there were issues when pretraining w/ the bias term in practice.
Andrew, thanks a lot for your reply. I can understand why to set bias=Flase and your intuition completely make sense for me. but my question is why use a BN as the last layer instead of a linear layer? isn't that the cosine similarity does the centering and makes a uni hypersphere latent space?
Oh, I see. I think this is in part to avoid feature collapsing when all the vectors are pointing almost in the same direction. This effect is studied here [1], for example.
[1] https://generallyintelligent.ai/understanding-self-supervised-contrastive-learning.html
thanks. that was helpful