
Timed out when calculating FID

shlee625 opened this issue · 1 comments

Hi. First of all, thanks for your great work.

I have a problem when calculating FID during training. Every args.save_checkpoint_frequency iterations, the model is evaluated by calculating FID score. However, at this phase, I have a timed out problem. Here is the error log.

[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802801 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802801 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 4414) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Before this log came out, there was no change in the prompt but GPUs were still working. As I'm not that familiar with distributed training, I want to ask how to fix this problem. I've tried to lower the number of args.fid_samples, but this doesn't help. Thank you.

Best Wishes,

I solved this problem. For those who encounter the same problem, please refer to my comment.

First of all, It takes about one and a half hours to calculate FID. But the default timeout argument of torch.distributed.init_process_group is set as 30 minutes. That's why I got this error.
All you need to do is just to specify how long you want to set, e.g., timeout=datetime.timedelta(0, 7200) for 2 hours, at the above line. Note that it takes datetime.timedelta object and the unit is the second.

Secondly, I installed torch-fidelity==0.3.0, which is the newest and the default version of the package when you install via pip. But the argument names are modified compared to the previous version (0.2.0).
So it doesn't take any positional arguments anymore. That's why I got TypeError here. So, if you installed 0.3.0, you should
change the line to:

metrics_dict = calculate_metrics(input1=save_dir, input2=fid_dataset, cuda=True, isc=False, fid=True, kid=False, verbose=False)

or just install torch-fidelity==0.2.0. I tried the former one but I believe the latter one will work either.
Hope my comments can help you.