How to apply multi-gpu training?
lifuguan opened this issue · 3 comments
lifuguan commented
Hello, thanks for the great work! I'm wondering how can we apply mulit-gpu training?
I use the following command
python train.py --config configs/gnt_ft_rffr.txt --distributed --local_rank 2
but it occurs the following problems:
Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
The distributed training code of train.py
is shown below:
if args.distributed:
torch.distributed.init_process_group(backend="nccl", init_method="env://localhost:50000")
args.local_rank = int(os.environ.get("LOCAL_RANK"))
torch.cuda.set_device(args.local_rank)
MukundVarmaT commented
Hi,
Thank you for your interest in our work! To train on multiple GPUs,
python3 -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=<num-gpus> --use_env --master_port=21221 train.py ... (remaining args)
lifuguan commented
Thanks! One more question following the issue, if I train the model with 8 gpus, should I change N_rand
from 4096 to 512?
MukundVarmaT commented
Yes, thats correct!