How to apply multi-gpu training?

Question

How to apply multi-gpu training?

lifuguan opened this issue a year ago · 3 comments

Hello, thanks for the great work! I'm wondering how can we apply mulit-gpu training?

I use the following command

python train.py --config configs/gnt_ft_rffr.txt --distributed --local_rank 2

but it occurs the following problems:

Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

The distributed training code of train.py is shown below:

    if args.distributed:
        torch.distributed.init_process_group(backend="nccl", init_method="env://localhost:50000")
        args.local_rank = int(os.environ.get("LOCAL_RANK"))
        torch.cuda.set_device(args.local_rank)

Answer 1 · 2023-05-05T09:36:38.000Z

Hi,

Thank you for your interest in our work! To train on multiple GPUs,

python3 -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=<num-gpus> --use_env --master_port=21221 train.py ... (remaining args)

Answer 2 · 2023-05-05T13:05:30.000Z

Thanks! One more question following the issue, if I train the model with 8 gpus, should I change N_rand from 4096 to 512?

Answer 3 · 2023-05-06T11:29:46.000Z

Yes, thats correct!