Question answering example throws an exception even if sanity check is skipped
Pointy-Hat opened this issue ยท 10 comments
๐ Bug
Running the squad example python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
throws an exception while finalizing training. This is not a replication of #218
To Reproduce
Steps to reproduce the behavior:
- Run
python train.py task=nlp/question_answering dataset=nlp/question_answering/squad trainer.gpus=[1] training.batch_size=8 trainer.num_sanity_val_steps=0
- See error
Epoch 0: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 12442/12445 [44:35<00:00, 4.65it/s, loss=0.957
Error executing job with overrides: ['task=nlp/question_answering', 'dataset=nlp/question_answering/squad', 'trainer.gpus=[1]', 'training.batch_size=8', 'trainer.num_sanity_val_steps=0']3 [01:39<00:00, 13.59it/s]
Traceback (most recent call last):
File "/home/vrt/lightning-transformers/train.py", line 10, in hydra_entry
main(cfg)
File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 69, in main
run(
File "/home/vrt/lightning-transformers/lightning_transformers/cli/train.py", line 60, in run
trainer.fit(model, datamodule=data_module)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 146, in run
self.on_advance_end()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
self._run_validation()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
self.val_loop.run()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 151, in run
output = self.on_run_end()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 134, in on_run_end
self._on_evaluation_epoch_end()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 241, in _on_evaluation_epoch_end
self.trainer.call_hook(hook_name)
File "/home/vrt/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1501, in call_hook
output = model_fx(*args, **kwargs)
File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/model.py", line 59, in on_validation_epoch_end
metric_dict = self.metric.compute()
File "/home/vrt/miniconda3/lib/python3.9/site-packages/torchmetrics/metric.py", line 380, in wrapped_func
value = compute(*args, **kwargs)
File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in compute
example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
File "/home/vrt/lightning-transformers/lightning_transformers/task/nlp/question_answering/datasets/squad/metric.py", line 23, in <listcomp>
example_ids = [reverse_lookup[i.item()] for i in self.example_ids]
KeyError: 0
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Environment
- PyTorch Version: 1.6.0
- OS: Ubuntu 18.04.6 LTS
- How you installed PyTorch:
conda
- Python version: 3.9.7
- CUDA/cuDNN version: 11.4
- GPU models and configuration: 2x NVIDIA GeForce RTX 2080 Ti (First device not used)
- Any other relevant information: The same error occurs during sanity check if
trainer.num_sanity_val_steps=-1
is used, as in #184
Strangely, I got the KeyError: 0
at some point earlier today without using trainer.num_sanity_val_steps=0
, but I haven't been able to reproduce it, nor do I get it when adding trainer.num_sanity_val_steps=0
as you say. Could caching be involved?
Ah, nevermind, this happens at the evaluation step, so we got to let it finish training the epoch first. I confirm I see this error too.
self.example_id_strings
seems to be empty at the time we use it to create reverse_lookup
, which will also be empty.
I attempt to fix this issue with PR #235.
@SeanNaren ^^ ๐ฐ
Bad bot.
Strangely I cant close this issue myself?
The QA task is really broken... I don't have time to debug it but if anyone can help would appreciate it!
@mariomeissner, would you be interested in diving in and debugging this issue? ๐ฐ
I've been away for a while and don't know the current situation. Was PR #235 not enough? I'd be happy to dig into this again if you point me in some direction ๐