nitishgupta/nmn-drop

Loss tensor without grad_fn arises during training

dcasbol opened this issue · 6 comments

After following the instructions and pip-installing the modified version of allennlp I was able to run the training script, but it lead to CUDA out-of-memory errors in my GPU, so I brought the batch size down from 4 to 3 and then it works, but I get the following error around 7% of the first epoch:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Is this something related to dependencies with the batch size or is there any way of solving it?

Note: I'm running this with CUDA 10.1

@dcasbol Hi! I came across the same problem and I was wondering if you managed to find a solution? I'm now struggling to find what are the possible reasons, and would greatly appreciate any hint.

@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size.

@Ramil0 To this moment I have not found solution to this. Fortunately, my research ended up not depending on it, but I would guess that some parts of the code are somehow hardcoded for working with that batch size.

@dcasbol Thank you for your reply! This is strange, I didn't change the batch size, and I get this error even with the default numbers in the config.

That's new info then! It doesn't depend on batch size. It then must have to do with the fact that some batches don't have any supervision. Maybe they implemented it in an older/newer version of PyTorch, which just skips that instead of raising the exception? If you're currently working on that, you could try to detect that scenario before it returns the value and substitute it by something like:

dummy_loss = torch.tensor(0.0, dtype=torch.float, requires_grad=True)

And see if it behaves in a reasonable way.

Sorry, I completely missed replying to this issue.

TBH, I don't know why this error occurs. I've spent quite a lot of time trying to fix it but couldn't. Maybe I'm missing something trivial. I had a thought similar what @dcasbol suggested, but couldn't get that to work. If you restart training, you shouldn't encounter the error, which to me suggests it has something to do with CUDA randomization.

@dcasbol Thank you, I'll try what you suggested.

@nitishgupta Thanks! Already restarted the training, there's hope then that I won't see the same error again:)