The issue of training occupying two GPUs

Question

The issue of training occupying two GPUs

yaoyao674 opened this issue a year ago · 4 comments

Hello author, your work is very good.

What I would like to ask is to execute Python train.py -- config configs/h36m/MotionAGFormer xsmall. yaml. I have observed that it will be trained on two GPUs. Is this normal? I did not find any location in the code where the number of GPUs is set. Can you point it out? Thank you

Answer 1 · 2024-01-11T08:56:59.000Z

device = 'cuda' if torch.cuda.is_available() else 'cpu'
   model = load_model(args)
   if torch.cuda.is_available():
      model = torch.nn.DataParallel(model)
   model.to(device)

Lines 256--260 in the file train.py You could set the parameters of DP as follow:

  if torch.cuda.is_available():
     model = torch.nn.DataParallel(model,device_ids=[0,1])

You could set the device_ids = [0,1,2,......] or whatever in your GPUs list

Answer 2 · 2024-01-11T16:42:16.000Z

Thanks @AsukaCamellia for answering it. I close the issue.

Answer 3 · 2024-01-14T02:52:41.000Z

thanks your help @AsukaCamellia @SoroushMehraban

Answer 4 · 2024-01-16T06:11:24.000Z

Before you execute the train.py, you can specify the GPUs which you want to use, i.e. export CUDA_VISIBLE_DEVICES=0,1 which will run the program on your first and second gpu in the machine.
The other way to do the same thing:

CUDA_VISIBLE_DEVICES=0,1 Python train.py -- config configs/h36m/MotionAGFormer-xsmall. yaml

I suggest you shoud use the torch.nn.parallel.DistributedDataParallel other than torch.nn.DataParallel if you need to train the model on multiple GPUs.