RuntimeError:element 0 of tensors does not require grad and does not have a grad_fn

Question

RuntimeError:element 0 of tensors does not require grad and does not have a grad_fn

mingzhu-wu opened this issue 5 years ago · 5 comments

Hi,
I got an error while training the MTBert baseline without SQuAD dataset, all the rest settings are the same with the example command except for the number of t_total.

The error is as follows:
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/allennlp/training/trainer.py", line 480, in train
train_metrics = self._train_epoch(epoch)
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/allennlp/training/trainer.py", line 327, in _train_epoch
loss.backward()
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The same error happens when I tried to add additional training examples.
I would be really appreciated to know the reason of this error and how to fix it.
Thanks in advance.

Answer 1 · 2019-07-21T06:21:30.000Z

Hi xfwmz, you referred to ZHO9504 problem. Do you also experience this only on 2 GPUs? what is your running command? which version of AllenNLP do you use? (also there was a code update in the last few weeks to the baseline model, be sure you are using the latest code)

Answer 2 · 2019-07-23T03:53:19.000Z

I got this error too..

Answer 3 · 2019-07-23T12:53:36.000Z

can you please send me the allennlp command you used here? thanks

Answer 4 · 2019-07-25T20:10:25.000Z

can you please send me the allennlp command you used here? thanks

I was running this command on 1 GPU, using allennlp version 0.8.4, this error appeared after 8 hours of training when the EM score reaches 54, as shown in the following:
EM: 54.6440, f1: 65.0914, qas_used_fraction: 1.0000, loss: 4.2103 ||: : 59878it [7:06:57, 2.50it/s]
EM: 54.6462, f1: 65.0929, qas_used_fraction: 1.0000, loss: 4.2101 ||: : 59906it [7:07:08, 2.54it/s]
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/run.py", line 21, in
run()
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/training/trainer.py", line 480, in train
train_metrics = self._train_epoch(epoch)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/training/trainer.py", line 327, in _train_epoch
loss.backward()
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Answer 5 · 2019-10-23T12:01:53.000Z

I found the bug that results in this RuntimeError, it happens when a batch contains only one training example, then the expression "len(np.argwhere(span_start.squeeze().cpu() >= 0)) > 0" in BERT_QA.py line 91 would be false even if the gold answer is given. In this case, the output loss will be zero and this error happens.

I fixed this bug by modifying the expression from "if span_start is not None and len(np.argwhere(span_start.squeeze().cpu() >= 0)) > 0:" to "if span_start is not None and len(np.argwhere(span_start.squeeze(-1).squeeze(-1).cpu() >= 0)) > 0:", now it works fine.