
How to run training with a single gpu

Dmytro-Shvetsov opened this issue · 1 comments

I am trying to launch training of any of the YOLOF models. However when I run
pods_train --num-gpus 1 --num-machines 1
I am getting an error

Traceback (most recent call last):
  File "/cyclists/lib/YOLOF/tools/", line 109, in <module>
  File "/cyclists/lib/YOLOF/cvpods/engine/", line 56, in launch
  File "/cyclists/lib/YOLOF/tools/", line 95, in main
  File "/cyclists/lib/YOLOF/cvpods/engine/", line 270, in train
    super().train(self.start_iter, self.start_epoch, self.max_iter)
  File "/cyclists/lib/YOLOF/cvpods/engine/", line 84, in train
  File "/cyclists/lib/YOLOF/cvpods/engine/", line 185, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../yolof_base/", line 134, in forward
    pred_logits, pred_anchor_deltas)
  File "../yolof_base/", line 210, in losses
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/", line 935, in all_reduce
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Could you guide me what I am doing wrong?
My setup is

| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 1070    Off  | 00000000:00:10.0 Off |                  N/A |
|  0%   46C    P8     8W / 180W |     20MiB /  8119MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |

Cuda 10.1

Screen Shot 2021-06-11 at 12 08 40 PM
If you want to run a job using single GPU, please make sure that the distributed part in the codes are well handled.