jxzhanggg/nonparaSeq2seqVC_code

Multi-GPU training

ivancarapinha opened this issue · 3 comments

Hello,
Could you please specify the steps to enable multi-GPU training, please?
I set distributed_run=True in hparams.py and then set --n_gpus=2 and CUDA_VISIBLE_DEVICES=0,3 in file run.sh to select GPUs 0 and 3, respectively. I did this and the code seems to enter some kind of deadlock because it does not start training.
Thank you.

The use of multi-GPU training is basically the same as in https://github.com/NVIDIA/tacotron2.
First create a directory named "logs", then run
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=2 --hparams=distributed_run=True

Thanks your impressive work.

when I use multi-GPU training, such as
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --n_gpus=2 --hparams=distributed_run=True

I run into the error, as shown below:

Traceback (most recent call last):
File "train.py", line 369, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 234, in train
train_loader, valset, collate_fn = prepare_dataloaders(hparams)
File "train.py", line 64, in prepare_dataloaders
drop_last=True, collate_fn=collate_fn)
File "/home/test/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 189, in init
raise ValueError('sampler option is mutually exclusive with '
ValueError: sampler option is mutually exclusive with shuffle

Hi, as the error message says, when using multi-GPU training, you need to set up shuffle=False in dataloader.