mrqa/MRQA-Shared-Task-2019

RuntimeError:element 0 of tensors does not require grad and does not have a grad_fn

mingzhu-wu opened this issue · 5 comments

Hi,
I got an error while training the MTBert baseline without SQuAD dataset, all the rest settings are the same with the example command except for the number of t_total.

The error is as follows:
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/allennlp/training/trainer.py", line 480, in train
train_metrics = self._train_epoch(epoch)
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/allennlp/training/trainer.py", line 327, in _train_epoch
loss.backward()
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ukp/mwu/MRQA-Shared-Task-2019/baseline/venv-3.6/lib/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The same error happens when I tried to add additional training examples.
I would be really appreciated to know the reason of this error and how to fix it.
Thanks in advance.

Hi xfwmz, you referred to ZHO9504 problem. Do you also experience this only on 2 GPUs? what is your running command? which version of AllenNLP do you use? (also there was a code update in the last few weeks to the baseline model, be sure you are using the latest code)

I got this error too..

can you please send me the allennlp command you used here? thanks

can you please send me the allennlp command you used here? thanks

Hi alontalmor, I reproduced this error with the following command:
python -m allennlp.run train https://multiqa.s3.amazonaws.com/config/MRQA_BERTbase.json -s MRQA-Shared-Task-2019/Models/MultiTrain -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': 'https://mrqa.s3.us-east-2.amazonaws.com/data/train/NewsQA.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/train/HotpotQA.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/train/SearchQA.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/train/TriviaQA-web.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/train/NaturalQuestionsShort.jsonl.gz', 'validation_data_path': 'https://mrqa.s3.us-east-2.amazonaws.com/data/dev/SQuAD.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/dev/NewsQA.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/dev/HotpotQA.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/dev/SearchQA.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/dev/TriviaQA-web.jsonl.gz,https://mrqa.s3.us-east-2.amazonaws.com/data/dev/NaturalQuestionsShort.jsonl.gz', 'trainer': {'cuda_device': 0, 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '120000'}}}" --include-package mrqa_allennlp

I was running this command on 1 GPU, using allennlp version 0.8.4, this error appeared after 8 hours of training when the EM score reaches 54, as shown in the following:
EM: 54.6440, f1: 65.0914, qas_used_fraction: 1.0000, loss: 4.2103 ||: : 59878it [7:06:57, 2.50it/s]
EM: 54.6462, f1: 65.0929, qas_used_fraction: 1.0000, loss: 4.2101 ||: : 59906it [7:07:08, 2.54it/s]
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/run.py", line 21, in
run()
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/training/trainer.py", line 480, in train
train_metrics = self._train_epoch(epoch)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/allennlp/training/trainer.py", line 327, in _train_epoch
loss.backward()
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/ukp-storage-1/mwu/venv-mrqa/lib64/python3.6/site-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I found the bug that results in this RuntimeError, it happens when a batch contains only one training example, then the expression "len(np.argwhere(span_start.squeeze().cpu() >= 0)) > 0" in BERT_QA.py line 91 would be false even if the gold answer is given. In this case, the output loss will be zero and this error happens.

I fixed this bug by modifying the expression from "if span_start is not None and len(np.argwhere(span_start.squeeze().cpu() >= 0)) > 0:" to "if span_start is not None and len(np.argwhere(span_start.squeeze(-1).squeeze(-1).cpu() >= 0)) > 0:", now it works fine.