Multi-GPU training is not working
dlazesz opened this issue · 6 comments
In a multi-GPU environment (eg. at lambda) the training stops with the following error:
Traceback (most recent call last):
File "emBERT/scripts/train_embert.py", line 502, in <module>
main()
File "emBERT/scripts/train_embert.py", line 460, in main
trainer.train()
File "emBERT/scripts/train_embert.py", line 239, in train
self.train_step(stats)
File "emBERT/scripts/train_embert.py", line 260, in train_step
label_ids, valid_ids, l_mask)
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/dlazesz/bert_szeged_maxNP/embert_venv/emBERT/embert/model.py", line 24, in forward
device=next(self.parameters()).device
StopIteration
self.parameters()
seems to yield empty iterator. The same setup runs flawlessly if only one GPU is used with CUDA_VISIBLE_DEVICES="1"
.
Did you managed to run it in such environment? Do you have any idea what could be wrong and how to fix this error?
I trained all my models on all 3 GPUs of lambda
. How did you invoke the training script?
The following Makefile
contains the used commands:
setup:
rm -rf embert_venv/
virtualenv -p python3 embert_venv
cd embert_venv && git clone https://github.com/DavidNemeskey/emBERT.git # Models not needed
./embert_venv/bin/pip install wheel
./embert_venv/bin/pip install -r embert_venv/emBERT/requirements.txt
train:
cd embert_venv && PYTHONPATH=`pwd`/emBERT ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out --do_train
train-one-gpu:
cd embert_venv && PYTHONPATH=`pwd`/emBERT CUDA_VISIBLE_DEVICES="1" ./bin/python3 emBERT/scripts/train_embert.py --data_dir ../corpus --bert_model bert-base-multilingual-cased --task_name szeged_chunk --data_format tsv --output_dir out --do_train
I used the two commands above to setup and run the training. ../corpus
directory contains the corpus supplied by you (train.txt, valid.txt, test.txt), out
dir is empty.
Did the underlying libraries changed or I am missing something?
Thank you for your help in advance!
I tried the command (NOT the Makefile) you have and it works for me without a hitch, running on 3 GPUs. What environment do you use? On my side, I have
- python 3.7.4 (from miniconda)
- torch 1.3.0
- transformers 2.9.1
From the error, I would suspect the torch version first. Would you do me a favor and run the script by hand to see if the error manifests that way as well? I mean:
- create a virtualenv
pip install -r requirements.txt
train_embert.py
...
If that doesn't work, would you install the whole package with pip install -e .
instead of just the requirements, and see if it fixes the issue? Thanks, looking forward to your results.
Pinning torch 1.3.0-1.4.0 yields the following error, but the training starts:
/home/dlazesz/bert_szeged_maxnp/embert_venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all
torch 1.5.0 as well as 1.5.1 prematurely stops at the beginning of the training with the aforementioned StopIteration exception.
I cannot judge if the warning above is serious and the StopIteration could be fixed somehow in emBert or not.
The easiest solution (for the exception) would be to pin all versions in requirements.txt. (This would leave the warning untouched.)
Feel free to fix the issue in your way! BTW I really like this piece of software. :)
PS. In any case my environment is:
Python 3.6.9 (system, virtualenv)
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
dataclasses==0.7
Deprecated==1.2.10
filelock==3.0.12
future==0.18.2
idna==2.9
joblib==0.15.1
numpy==1.19.0
packaging==20.4
pkg-resources==0.0.0
progressbar==2.5
PyGithub==1.51
PyJWT==1.7.1
pyparsing==2.4.7
PyYAML==5.3.1
regex==2020.6.8
requests==2.24.0
sacremoses==0.0.43
sentencepiece==0.1.91
seqeval==0.0.5
six==1.15.0
tokenizers==0.7.0
torch==1.4.0
tqdm==4.46.1
transformers==2.11.0
urllib3==1.25.9
wrapt==1.12.1
The error is a known issue for torch: pytorch/pytorch#40457
I am still not sure about the warning, though.
@dlazesz Thanks for investigating the issue. I am locking torch < 1.5 in setup.py
and requirements.txt
.
As for the warning, I think it's nothing to worry about. CrossEntropyLoss returns a scalar, and the Gather
function raises a warning in this case for some reason. But it still handles the data correctly.