Multi-GPU training
ivancarapinha opened this issue · 3 comments
Hello,
Could you please specify the steps to enable multi-GPU training, please?
I set distributed_run=True
in hparams.py
and then set --n_gpus=2
and CUDA_VISIBLE_DEVICES=0,3
in file run.sh
to select GPUs 0 and 3, respectively. I did this and the code seems to enter some kind of deadlock because it does not start training.
Thank you.
The use of multi-GPU training is basically the same as in https://github.com/NVIDIA/tacotron2.
First create a directory named "logs", then run
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=2 --hparams=distributed_run=True
Thanks your impressive work.
when I use multi-GPU training, such as
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=2 --hparams=distributed_run=True
I run into the error, as shown below:
Traceback (most recent call last):
File "train.py", line 369, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 234, in train
train_loader, valset, collate_fn = prepare_dataloaders(hparams)
File "train.py", line 64, in prepare_dataloaders
drop_last=True, collate_fn=collate_fn)
File "/home/test/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 189, in init
raise ValueError('sampler option is mutually exclusive with '
ValueError: sampler option is mutually exclusive with shuffle
Hi, as the error message says, when using multi-GPU training, you need to set up shuffle=False in dataloader.