advimman/CIPS

Timed out when calculating FID

shlee625 opened this issue · 1 comments

Hi. First of all, thanks for your great work.

I have a problem when calculating FID during training. Every args.save_checkpoint_frequency iterations, the model is evaluated by calculating FID score. However, at this phase, I have a timed out problem. Here is the error log.

[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802801 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802801 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 4414) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Before this log came out, there was no change in the prompt but GPUs were still working. As I'm not that familiar with distributed training, I want to ask how to fix this problem. I've tried to lower the number of args.fid_samples, but this doesn't help. Thank you.

Best Wishes,
Lee

I solved this problem. For those who encounter the same problem, please refer to my comment.

First of all, It takes about one and a half hours to calculate FID. But the default timeout argument of torch.distributed.init_process_group is set as 30 minutes. That's why I got this error.
https://github.com/saic-mdal/CIPS/blob/eadae6e45d8c1f3657faa88a065b59990747cd16/train.py#L351
All you need to do is just to specify how long you want to set, e.g., timeout=datetime.timedelta(0, 7200) for 2 hours, at the above line. Note that it takes datetime.timedelta object and the unit is the second.

Secondly, I installed torch-fidelity==0.3.0, which is the newest and the default version of the package when you install via pip. But the argument names are modified compared to the previous version (0.2.0).
https://github.com/saic-mdal/CIPS/blob/eadae6e45d8c1f3657faa88a065b59990747cd16/calculate_fid.py#L25
So it doesn't take any positional arguments anymore. That's why I got TypeError here. So, if you installed 0.3.0, you should
change the line to:

metrics_dict = calculate_metrics(input1=save_dir, input2=fid_dataset, cuda=True, isc=False, fid=True, kid=False, verbose=False)

or just install torch-fidelity==0.2.0. I tried the former one but I believe the latter one will work either.
Hope my comments can help you.