Error found in validating when use 2 gpu(But it'ok when using one gpu )..

Question

Error found in validating when use 2 gpu(But it'ok when using one gpu )..

ZHO9504 opened this issue 5 years ago · 9 comments

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
EM: 61.2193, f1: 69.6262, qas_used_fraction: 1.0000, loss: 4.3453 ||: : 17502it [6:26:59, 1.33s/it]
2019-07-20 15:09:22,954 - INFO - allennlp.training.trainer - Validating
EM: 48.9301, f1: 59.0550, qas_used_fraction: 1.0000, loss: 5.1889 ||: : 94it [00:41, 2.15it/s]Traceback (most recent call last):
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/gpu245/anaconda3/envs/emnlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 21, in
run()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 116, in train_model_from_args
args.cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 160, in train_model_from_file
cache_directory, cache_prefix)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/commands/train.py", line 243, in train_model
metrics = trainer.train()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 493, in train
val_loss, num_batches = self._validation_loss()
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 430, in _validation_loss
loss = self.batch_loss(batch_group, for_training=False)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/trainer.py", line 258, in batch_loss
output_dict = training_util.data_parallel(batch_group, self.model, self._cuda_devices)
File "/home/gpu245/.local/lib/python3.7/site-packages/allennlp/training/util.py", line 336, in data_parallel
losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, outputs)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/gpu245/.local/lib/python3.7/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: tensor.ndimension() == static_cast<int64_t>(expected_size.size()) ASSERT FAILED at /pytorch/torch/csrc/cuda/comm.cpp:232, please report a bug to PyTorch. (gather at /pytorch/torch/csrc/cuda/comm.cpp:232)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6d3dad8441 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6d3dad7d7a in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x962 (0x7f6d132be792 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #3: + 0x5a3d1c (0x7f6d33e0bd1c in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130fac (0x7f6d33998fac in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5567e0e3c6e4 in python3.7)
frame #6: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #7: _PyEval_EvalFrameDefault + 0x4e8c (0x5567e0e982bc in python3.7)
frame #8: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #9: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #10: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #11: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #12: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #13: THPFunction_apply(_object, _object*) + 0x6b1 (0x7f6d33c1c301 in /home/gpu245/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: PyCFunction_Call + 0xe7 (0x5567e0dffbe7 in python3.7)
frame #15: _PyEval_EvalFrameDefault + 0x5d21 (0x5567e0e99151 in python3.7)
frame #16: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #18: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #19: _PyEval_EvalCodeWithName + 0xbb9 (0x5567e0dd9db9 in python3.7)
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #21: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #23: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #25: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #26: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #27: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #28: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #29: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #30: _PyEval_EvalFrameDefault + 0x6a0 (0x5567e0e93ad0 in python3.7)
frame #31: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #32: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #33: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #35: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #36: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #38: _PyEval_EvalFrameDefault + 0x4aa9 (0x5567e0e97ed9 in python3.7)
frame #39: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #40: _PyFunction_FastCallKeywords + 0x387 (0x5567e0e3ba27 in python3.7)
frame #41: _PyEval_EvalFrameDefault + 0x14ce (0x5567e0e948fe in python3.7)
frame #42: _PyFunction_FastCallKeywords + 0xfb (0x5567e0e3b79b in python3.7)
frame #43: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #44: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #45: PyEval_EvalCodeEx + 0x44 (0x5567e0dda3c4 in python3.7)
frame #46: PyEval_EvalCode + 0x1c (0x5567e0dda3ec in python3.7)
frame #47: + 0x1e004d (0x5567e0ea304d in python3.7)
frame #48: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x5567e0e3c569 in python3.7)
frame #49: _PyCFunction_FastCallKeywords + 0x21 (0x5567e0e3c801 in python3.7)
frame #50: _PyEval_EvalFrameDefault + 0x4755 (0x5567e0e97b85 in python3.7)
frame #51: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #52: _PyFunction_FastCallKeywords + 0x325 (0x5567e0e3b9c5 in python3.7)
frame #53: _PyEval_EvalFrameDefault + 0x416 (0x5567e0e93846 in python3.7)
frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x5567e0dd94f9 in python3.7)
frame #55: _PyFunction_FastCallDict + 0x1d5 (0x5567e0dda5d5 in python3.7)
frame #56: + 0x222d77 (0x5567e0ee5d77 in python3.7)
frame #57: + 0x23ae95 (0x5567e0efde95 in python3.7)
frame #58: _Py_UnixMain + 0x3c (0x5567e0efdf7c in python3.7)
frame #59: __libc_start_main + 0xf0 (0x7f6d4ea12830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #60: + 0x1e0122 (0x5567e0ea3122 in python3.7)

I don't know why....

Answer 1 · 2019-07-20T15:21:21.000Z

My running script is,
python3.7 -m allennlp.run train /home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/baseline/MRQA_BERTLarge.jsonnet -s Models/large_f5/ -o "{'dataset_reader': {'sample_size': 75000}, 'validation_dataset_reader': {'sample_size': 1000}, 'train_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/train/TriviaQA-web.jsonl.gz', 'validation_data_path': '/home/gpu245/haiou/emnlpworkshop/MRQA-Shared-Task-2019/data/dev-indomain/TriviaQA-web.jsonl.gz', 'trainer': {'cuda_device': [0,1], 'num_epochs': '2', 'optimizer': {'type': 'bert_adam', 'lr': 3e-05, 'warmup': 0.1, 't_total': '50000'}}}" --include-package mrqa_allennlp

whatever the train_data_path,

Answer 2 · 2019-07-21T06:19:13.000Z

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

Answer 3 · 2019-07-21T09:22:21.000Z

Hi ZHO9504, i will try to reproduce this, but because this does not happen on 1 GPU it's likely to be an allennlp problem with multiGPU, which version of allennlp are you using? thanks

Thank you for your reply. The version of allennlp I use:
$ allennlp --version
allennlp 0.8.5-unreleased`
and had same issue using V0.8.4
torch1.1.0
It's ok when validate the data: HotpotQA\SearchQA using one or two gpu.
But have the issue when valating trival/NaturalQuestionsShort/SearchQA with 2 gpu.
A little strange.....

Answer 4 · 2019-07-21T19:34:21.000Z

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

Answer 5 · 2019-07-22T00:51:21.000Z

It sounds like some edge case that's a bit difficult to reproduce...
Does it happen when you evaluate only on TriviaQA or NaturalQuestionsShort?

Yes, I evaluated on each of them , but only HotpotQA or SearchQA went well.
And, as long as the evaluation data include such as TriviaQA, then procedure error

Answer 6 · 2019-07-23T12:52:16.000Z

Ok i'm trying to recreate and solve this, but it may take a few days.

Answer 7 · 2019-08-06T19:10:01.000Z

I also got this error during multi-gpu validation but fine on a single gpu. Using allennlp V0.8.4 and torch 1.1.0.

Answer 8 · 2019-10-26T17:22:27.000Z

+1.
I also got this error during multiple-gpu validation phrase. Using allennlp V0.8.4 and torch 1.1.0.

Answer 9 · 2022-01-12T09:14:29.000Z

I was able to train on every MRQA task using every number of GPUs using pytorch-lightning. I published the scripts here: https://github.com/lucadiliello/mrqa-lightning