manojpamk/pytorch_xvectors

Running speaker embeeding training on multiple GPUs on single node

Opened this issue · 1 comments

Hello,
Thanks for sharing the PYtorch code for embedding training.
If we look at thepytorch_xvectors/pytorch_run.sh,
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1
train_xent.py exp/xvector_nnet_1a/egs/
If we look at the above line,it seems like you are training the DNN on using single GPU. Is it possible to train using multiple gpus?

Further if we look at the train_utils.py script,
def prepareModel(args):
elif args.trainingMode == 'init':
net.to(device)
net = torch.nn.parallel.DistributedDataParallel(net,
device_ids=[0],
output_device=0)
if torch.cuda.device_count() > 1:
print("Using ", torch.cuda.device_count(), "GPUs!")
net = nn.DataParallel(net)

Why we are using both torch.nn.parallel.DistributedDataParallel and net = nn.DataParallel(net) ?
When I tried to train, it's training using single GPU. How it needs to modified to train on multiple gpus?

I look forward to hearing from you.

Thanks.

K. Ahilan

Hello,

I think the code can be run on multiple GPUs using DataParallel, but I haven't figured out how to do the same since I did not have access to a node with multiple GPUs in my university cluster.

I use DistributedDataParallel since this spawns multiple processes in a single GPU which greatly improves training time. This feature was particularly useful since I had access to a single V100 node, and each process was ~4GB.

I have included the if statement for multiple GPU check as a debug option in case I ever got access to a multiple GPU node, but that never happened 😄

I'll leave this issue open in case if someone figures out how to do this