Keep getting nan loss

Question

Keep getting nan loss

Closed this issue 9 months ago · 5 comments

I keep getting nan contrastive loss. Is there any specific format in which the labels and features have to be passed? I tried both:

features.shape = B * num_features # num_features is 640
labels.shape = B * 1 # class number

features.shape = B * num_features # num_features is 640
labels.shape = B # class number

Tested the code using a main function:

if __name__ == "__main__":
    # Example inputs
    batch_size = 64
    embedding_dim = 128
    num_classes = 10

    # Generate random features and labels for testing
    features = torch.rand((batch_size, embedding_dim))
    labels = torch.randint(0, num_classes, (batch_size,))

    # Call the loss function
    loss = asy_focal_SupConLoss(features, labels, temp=0.7) # non-nan loss ~0.78 loss
    print("Loss:", loss.item())
    loss = asy_focal_SupConLoss(features, labels, gamma=7, eta=250, temp=0.7) # nan loss
    print("Loss:", loss.item())
    loss = asy_focal_SupConLoss(features, labels, gamma=7, eta=250, temp=0.9) # non-nan loss ~0.45 loss
    print("Loss:", loss.item())
    loss = asy_focal_SupConLoss(torch.zeros((batch_size, embedding_dim)), labels, gamma=7, eta=250, temp=0.07) # non-nan loss ~3.909 loss
    print("Loss:", loss.item())

This seems like very erratic and odd behavior. Temperature is suggested to be between 0.07 and 0.2 but that always initially leads to nan loss. (However if the features are all zero, it also works).
Is the implementation even correct? should I try and modify some hyper parameters? Do the features need to be in a specific tiny range like 0 to 0.2

Furthermore, increasing embedding dim to 640 also causes nan losses in all but the last statement. output of main with embedding_dim = 640

nan contrastive loss
Loss: 0.0
nan contrastive loss
Loss: 0.0
nan contrastive loss
Loss: 0.0
Loss: 3.9059367179870605

Answer 1 · 2023-09-15T02:09:49.000Z

Checking my own features in our experiments with FMNIST, the features range from -0.2 to 0.2, with embedding_dim = 128 and temp = 0.07. This produces no nan notifications, so I recommend that you scale your features to this range to avoid the nan error. Alternatively, you can either decrease embedding_dim or increase temp.

You may find this page useful if you keep having issues: HobbitLong/SupContrast#104. That link leads to a discussion about the same nan issue present in the original supervised contrastive loss.

Answer 2 · 2023-09-15T05:32:51.000Z

Thank you. Sorry I didn't update here earlier, but I read the paper, and they take the l2 norm of the embedding vector. This worked and produced non-nas losses.

Another issue I am facing now though is that the loss converges to around 0.5 in the first epoch and then more or less stays there for the rest of the model's run. Perhaps this is because I am using a pretrained backbone, but logistic loss (cross-entropy) would keep reducing throughout 20 or even a 100 epochs, this contrastive loss just fluctuates around a single value with not much change after first epoch. Is this supposed to happen?

Answer 3 · 2023-09-15T06:09:23.000Z

In our binary classification trials, the loss steadily goes down after each epoch (in one experiment, the loss is around 4.15 after one epoch and around 3.09 after 20 epochs). I am not sure what exactly causes your problem, but I recommend tuning the learning rate to see if the issue persists.

Since your loss is around 0.5, I assume you are handling a multi-class classification problem. I admit that we have not experimented much with multi-class classification. As such, the suggested hyperparameters may not be suitable for your case. I suggest trying the original SupCon loss first (by taking gamma = 0, eta = 0, and temp = 0.07) if you have not done so. It may also be the case that cross-entropy is more suitable than SupCon for your particular problem.

Answer 4 · 2023-09-15T19:34:36.000Z

Yeah, my task is recognising Chinese characters (single character classification), I will try what you recommended but using the original NT-Xent loss, I had similar results. I think it may be because the pre-trained network is already good enough at separating the embeddings. Whereas perhaps training from scratch causes the loss to steadily go down.

I understand your loss converged to 3.09 after 20 epochs, did you train your model from scratch?

Thanks a lot for the help btw :)

Answer 5 · 2023-09-15T22:06:02.000Z

I used a pre-trained ResNet-50 backbone for training, so not from scratch. The loss tends to be greater if you have fewer classes (try setting num_classes = 2 inside the main function, for example).

You are welcome, and I hope you are successful.