Multi-GPU training is not working

Question

Multi-GPU training is not working

dlazesz opened this issue 5 years ago · 6 comments

In a multi-GPU environment (eg. at lambda) the training stops with the following error:

Traceback (most recent call last):
  File "emBERT/scripts/train_embert.py", line 502, in <module>
    main()
  File "emBERT/scripts/train_embert.py", line 460, in main
    trainer.train()
  File "emBERT/scripts/train_embert.py", line 239, in train
    self.train_step(stats)
  File "emBERT/scripts/train_embert.py", line 260, in train_step
    label_ids, valid_ids, l_mask)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dlazesz/bert_szeged_maxNP/embert_venv/emBERT/embert/model.py", line 24, in forward
    device=next(self.parameters()).device
StopIteration

self.parameters() seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used with CUDA_VISIBLE_DEVICES="1" .
Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?

Answer 1 · 2020-06-25T05:22:59.000Z

I trained all my models on all 3 GPUs of lambda. How did you invoke the training script?

Answer 2 · 2020-06-25T07:29:40.000Z

The following Makefile contains the used commands:

setup:
	rm -rf embert_venv/
	virtualenv -p python3 embert_venv
	cd embert_venv && git clone https://github.com/DavidNemeskey/emBERT.git  # Models not needed
	./embert_venv/bin/pip install wheel
	./embert_venv/bin/pip install -r embert_venv/emBERT/requirements.txt

train:
	cd embert_venv && PYTHONPATH=`pwd`/emBERT ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out  --do_train

train-one-gpu:
	cd embert_venv && PYTHONPATH=`pwd`/emBERT CUDA_VISIBLE_DEVICES="1" ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out  --do_train

I used the two commands above to setup and run the training. ../corpus directory contains the corpus supplied by you (train.txt, valid.txt, test.txt), out dir is empty.

Did the underlying libraries changed or I am missing something?

Thank you for your help in advance!

Answer 3 · 2020-06-25T09:12:37.000Z

I tried the command (NOT the Makefile) you have and it works for me without a hitch, running on 3 GPUs. What environment do you use? On my side, I have

python 3.7.4 (from miniconda)
torch 1.3.0
transformers 2.9.1

From the error, I would suspect the torch version first. Would you do me a favor and run the script by hand to see if the error manifests that way as well? I mean:

create a virtualenv
pip install -r requirements.txt
train_embert.py ...

If that doesn't work, would you install the whole package with pip install -e . instead of just the requirements, and see if it fixes the issue? Thanks, looking forward to your results.

Answer 4 · 2020-06-25T10:21:54.000Z

Pinning torch 1.3.0-1.4.0 yields the following error, but the training starts:

/home/dlazesz/bert_szeged_maxnp/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all

torch 1.5.0 as well as 1.5.1 prematurely stops at the beginning of the training with the aforementioned StopIteration exception.

I cannot judge if the warning above is serious and the StopIteration could be fixed somehow in emBert or not.
The easiest solution (for the exception) would be to pin all versions in requirements.txt. (This would leave the warning untouched.)
Feel free to fix the issue in your way! BTW I really like this piece of software. :)

PS. In any case my environment is:

Python 3.6.9 (system, virtualenv)

certifi==2020.6.20
chardet==3.0.4
click==7.1.2
dataclasses==0.7
Deprecated==1.2.10
filelock==3.0.12
future==0.18.2
idna==2.9
joblib==0.15.1
numpy==1.19.0
packaging==20.4
pkg-resources==0.0.0
progressbar==2.5
PyGithub==1.51
PyJWT==1.7.1
pyparsing==2.4.7
PyYAML==5.3.1
regex==2020.6.8
requests==2.24.0
sacremoses==0.0.43
sentencepiece==0.1.91
seqeval==0.0.5
six==1.15.0
tokenizers==0.7.0
torch==1.4.0
tqdm==4.46.1
transformers==2.11.0
urllib3==1.25.9
wrapt==1.12.1

Answer 5 · 2020-06-25T10:29:43.000Z

The error is a known issue for torch: pytorch/pytorch#40457
I am still not sure about the warning, though.

Answer 6 · 2020-06-25T11:52:19.000Z

@dlazesz Thanks for investigating the issue. I am locking torch < 1.5 in setup.py and requirements.txt.

As for the warning, I think it's nothing to worry about. CrossEntropyLoss returns a scalar, and the Gather function raises a warning in this case for some reason. But it still handles the data correctly.