mrqa/MRQA-Shared-Task-2019

Error found in validating when use 2 gpu(But it'ok when using one gpu )..

ZHO9504 opened this issue · 9 comments

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
EM: 61.2193, f1: 69.6262, qas_used_fraction: 1.0000, loss: 4.3453 ||: : 17502it [6:26:59, 1.33s/it]
2019-07-20 15:09:22,954 - INFO - allennlp.training.trainer - Validating
EM: 48.9301, f1: 59.0550, qas_used_fraction: 1.0000, loss: 5.1889 ||: : 94it [00:41, 2.15it/s]Traceback (most recent call last):
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 21, in
run()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 493, in train
val_loss, num_batches = self._validation_loss()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss
loss = self.batch_loss(batch_group, for_training=False)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 258, in batch_loss
output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/util.py", line 336, in data_parallel
losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: tensor.ndimension() == static_cast<int64_t>(expected_size.size()) ASSERT FAILED at /pytorch/torch/csrc/cuda/comm.cpp:232, please report a bug to PyTorch. (gather at /pytorch/torch/csrc/cuda/comm.cpp:232)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d3dad8441 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d3dad7d7a in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x962 (0x7f6d132be792 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a3d1c (0x7f6d33e0bd1c in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130fac (0x7f6d33998fac in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5567e0e3c6e4 in python3.7)
frame #6: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #7: _PyEval_EvalFrameDefault + 0x4e8c (0x5567e0e982bc in python3.7)
frame #8: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #9: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #10: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #11: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #12: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #13: THPFunction_apply(_object
, _object*) + 0x6b1 (0x7f6d33c1c301 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: PyCFunction_Call + 0xe7 (0x5567e0dffbe7 in python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x5d21 (0x5567e0e99151 in python3.7)
frame #16: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #18: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #19: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #21: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #23: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #25: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #27: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #28: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #29: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #32: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #33: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #35: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #38: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #40: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #41: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #42: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #43: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #44: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #45: PyEval_EvalCodeEx + 0x44 (0x5567e0dda3c4 in python3.7)
frame #46: PyEval_EvalCode + 0x1c (0x5567e0dda3ec in python3.7)
frame #47: + 0x1e004d (0x5567e0ea304d in python3.7)
frame #48: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x5567e0e3c569 in python3.7)
frame #49: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #50: _PyEval_EvalFrameDefault + 0x4755 (0x5567e0e97b85 in python3.7)
frame #51: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #52: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #53: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #55: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #56: + 0x222d77 (0x5567e0ee5d77 in python3.7)
frame #57: + 0x23ae95 (0x5567e0efde95 in python3.7)
frame #58: _Py_UnixMain + 0x3c (0x5567e0efdf7c in python3.7)
frame #59: __libc_start_main + 0xf0 (0x7f6d4ea12830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1e0122 (0x5567e0ea3122 in python3.7)

I don't know why....

My running script is,
python3.7 -m allennlp.run train /home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/baseline/MRQA_BERTLarge.jsonnet -s Models/large_f5/ -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/train/TriviaQA-web.jsonl.gz', 'validation_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/dev-indomain/TriviaQA-web.jsonl.gz', 'trainer': {'cuda_device': [0,1], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '50000'}}}" --include-package mrqa_allennlp

whatever the train_data_path,

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

  1. Thank you for your reply. The version of allennlp I use:
    $ allennlp --version
    allennlp 0.8.5-unreleased`
    and had same issue using V0.8.4
    torch1.1.0
  2. It's ok when validate the data: HotpotQA\SearchQA using one or two gpu.
    But have the issue when valating trival/NaturalQuestionsShort/SearchQA with 2 gpu.
    A little strange.....

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

Yes, I evaluated on each of them , but only HotpotQA or SearchQA went well.
And, as long as the evaluation data include such as TriviaQA, then procedure error

Ok i'm trying to recreate and solve this, but it may take a few days.

I also got this error during multi-gpu validation but fine on a single gpu. Using allennlp V0.8.4 and torch 1.1.0.

+1.
I also got this error during multiple-gpu validation phrase. Using allennlp V0.8.4 and torch 1.1.0.

I was able to train on every MRQA task using every number of GPUs using pytorch-lightning. I published the scripts here: https://github.com/lucadiliello/mrqa-lightning