SeungjunNah/DeepDeblur-PyTorch

DistributedEvalSamper hangs at the end of the script when using DDP

Closed this issue · 5 comments

Dear author,
Thank you at first for your great work!

I am trying to use your implementation of DistributedEvalSampler for an evaluation purpose, jointly with DDP.
(with shuffle=Flase and no calling of set_epoch(); after calling DistributedEvalSampler for yielding test samples on evaluating a model, my program should be finished)

At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs.
(the last device is soley terminated with no errors)
When replaced with DistributedSampler, this is not occurred.

I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device,
but it is not the root cause as it is still occurred when I turned off the logging tool.

I wonder if you could point out conditions that I missed, please?
Thank you in advance.

Best,
Adam

Hi @vaseline555,

  1. Is your dataset size divisible by the number of GPUs?
    If so, there should be no difference in the behavior of DistributedSampler and DistributedEvalSampler.

  2. Are you using any kind of communication between processes that requires synchronization, i.e., back-propagation?
    DistributedEvalSampler does not require any communication between processes and I don't think it will be the source of hanging.
    If you are using other synchronization-based operations, they may expect the same dataset length per process.
    For example, if your total dataset size is 5 and you are using 3 processes, GPU 0 and 1 will be processing the 2nd item while GPU 2 is done after the 1st iteration.
    If you are using a synchronization-based operation, GPU 0 and 1 will be waiting for the response from GPU 2 which will never occur.
    When I need to do backpropagation at test time for each item, I turn off synchronization.

self.model.model.G.require_backward_grad_sync = False   # compute without DDP sync

Best,
Seungjun

Dear @SeungjunNah,

Thank you for your detailed answers.
Like you presumed, it is exactly the case 2 that I have faced: uneven inputs are provided across different ranks.

Though there's no typical synchronization operation like backward() except .item() or .detach().cpu(),
the main problem is the position I called torch.distributed.barrier()...

I called it at the end of every iteration, not the end of every epoch.
Thus, when inputs of the rank having less inputs are depleted (which surely has less iterations than others),
it escapes out the evaluation loop faster than others, thereby other ranks are hanging around the barrier...

I fixed it by replacing the barrier to other position (i.e., at the end of epoch), and now things are going well.
While Googling, many people have trouble with treating uneven inputs when using DDP.
(FYI: pytorch/pytorch#38174; Lightning-AI/pytorch-lightning#3325; pytorch/pytorch#72423), even though I tried using DDP.join() context manager, yours finally worked as a solution. 👍
I would like to thank you again for sharing your implementation of DistributedEvalSampler.

Have a nice day!
Thank you.

Sincerely,
Adam

DaoD commented

How can I use DistributedEvalSampler when I have to use dist.all_gather() to collect results? Many thx!

@DaoD
I don't know where you want to call all_gather but I do all_reduce outside the loop.
In my case, all processes are independent and the communications are done after the loop to collect loss/metric statistics.

In train.py, I compute loss/metrics from the outputs here.

self.criterion(output, target)

Outside the loop, here, I call

self.criterion.normalize()

which is defined here with dist.all_reduce inside.

If you want to call all_gather during the for loop, I think it will hang.
But then, that will be the case you need all processes to work together and that's not an expected use case of DistributedEvalSampler.

DaoD commented

@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop.