Question about cutting gradient

Hello and thanks for your work. I would like to ask regarding the “on-the-fly” linear evaluation that you do after each pre-training epoch. In the tf original implementation, they cut the gradient during linear evaluation training to ensure that labels are not backpropgated to the ResNet-50. In PyTorch, this can be done by detach(). But I see that you only include a seperate optimizer for the linear evaluation and include the parameters of this linear evaluation layer in the optimizer (which means they will be updated only). That makes sense, but have you made sure that this is the case? (Have you made sure that using a seperate optimizer ensures the gradient does not flow to the resnet-50)?

Hi @fawazsammani thanks for your question. Rest assured, no gradients are backpropagating to the ResNet. The training set that we use for L-BFGS is created within a torch.no_grad() block. See here

pytorch-simclr/evaluate/lbfgs.py

Line 11 in dc9ac57

with torch.no_grad():

If we were backpropagating gradients for all 50000 training examples to the ResNet we would definitely hit an out of memory error!

I see. Sorry, i didn't look at that portion of the code. Many thanks for your reply and your work.

No problem! Thanks for your interest