Multi-GPU training

Question

Multi-GPU training

ivancarapinha opened this issue 4 years ago · 3 comments

Hello,
Could you please specify the steps to enable multi-GPU training, please?
I set distributed_run=True in hparams.py and then set --n_gpus=2 and CUDA_VISIBLE_DEVICES=0,3 in file run.sh to select GPUs 0 and 3, respectively. I did this and the code seems to enter some kind of deadlock because it does not start training.
Thank you.

Answer 1 · 2020-05-26T10:55:31.000Z

The use of multi-GPU training is basically the same as in https://github.com/NVIDIA/tacotron2.
First create a directory named "logs", then run
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=2 --hparams=distributed_run=True

Answer 2 · 2020-06-08T11:05:48.000Z

Thanks your impressive work.

when I use multi-GPU training, such as
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=2 --hparams=distributed_run=True

I run into the error， as shown below:

Traceback (most recent call last):
File "train.py", line 369, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 234, in train
train_loader, valset, collate_fn = prepare_dataloaders(hparams)
File "train.py", line 64, in prepare_dataloaders
drop_last=True, collate_fn=collate_fn)
File "/home/test/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 189, in init
raise ValueError('sampler option is mutually exclusive with '
ValueError: sampler option is mutually exclusive with shuffle

Answer 3 · 2020-06-10T20:30:24.000Z

Hi, as the error message says, when using multi-GPU training, you need to set up shuffle=False in dataloader.