[Finetuning OneFormer] How to use multiple GPUs

Question

[Finetuning OneFormer] How to use multiple GPUs

EricLe-dev opened this issue 2 months ago · 0 comments

Dear @NielsRogge. First and foremost, thank you so much for your fantastic works. I did follow your tutorial and was able to finetune OneFormer. However, when I try to finetune the model on multi GPUs, it did not work.

I did two approaches:

1. Using DataParallel

import torch.nn as nn
# some code the same as your tutorial
processor.image_processor.num_text = model.config.num_queries - model.config.text_encoder_n_ctx

train_dataset = CustomDataset(processor)
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=16)
optimizer = AdamW(model.parameters(), lr=5e-5)

model = nn.DataParallel(model)
device = 'cuda'
model.to(device)
model.train()

for epoch in range(20):  # loop over the dataset multiple times
    for batch in train_dataloader:
        # zero the parameter gradients
        optimizer.zero_grad()
        batch = {k:v.to(device) for k,v in batch.items()}

        # forward pass
        outputs = model(**batch)

        # backward pass + optimize
        loss = outputs.loss
        print("Loss:", loss.item())
        loss.backward()
        optimizer.step()

This code running normally but just only GPU:0 was utilized, the other GPUs do not seems to work.
Here is the result from nvidia-smi while it's running:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.239.06   Driver Version: 470.239.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 55%   58C    P2   196W / 356W |  20651MiB / 24268MiB |     71%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:3C:00.0 Off |                  N/A |
| 59%   57C    P2   121W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:5E:00.0 Off |                  N/A |
| 53%   54C    P2   120W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:86:00.0 Off |                  N/A |
| 53%   47C    P2   118W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce ...  Off  | 00000000:D8:00.0 Off |                  N/A |
| 60%   58C    P2   137W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce ...  Off  | 00000000:D9:00.0 Off |                  N/A |
| 60%   58C    P2   111W / 356W |      8MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   2809467      C   python                          20643MiB |
|    1   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      2170      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

2. Using Accelerate
Following this tutorial, I modified the code as following:

processor.image_processor.num_text = model.config.num_queries - model.config.text_encoder_n_ctx

train_dataset = CustomDataset(processor)
# val_dataset = CustomDataset(processor)

train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=16)
optimizer = AdamW(model.parameters(), lr=5e-5)


accelerator = Accelerator()
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)


model.train()

for epoch in range(20):  # loop over the dataset multiple times
    for batch in train_dataloader:

        # zero the parameter gradients
        optimizer.zero_grad()
        # batch = {k:v.to(device) for k,v in batch.items()}

        # forward pass
        outputs = model(**batch)

        # backward pass + optimize
        loss = outputs.loss
        print("Loss:", loss.item())
        accelerator.backward(loss)
        optimizer.step()

This code was running normally, except only GPU:0 works.

I'm quite sure that I'm missing something here. Can you please point me to the right direction? Thank you so much!