Timed out when calculating FID
shlee625 opened this issue · 1 comments
Hi. First of all, thanks for your great work.
I have a problem when calculating FID during training. Every args.save_checkpoint_frequency
iterations, the model is evaluated by calculating FID score. However, at this phase, I have a timed out problem. Here is the error log.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802801 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802801 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 4414) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
Before this log came out, there was no change in the prompt but GPUs were still working. As I'm not that familiar with distributed training, I want to ask how to fix this problem. I've tried to lower the number of args.fid_samples
, but this doesn't help. Thank you.
Best Wishes,
Lee
I solved this problem. For those who encounter the same problem, please refer to my comment.
First of all, It takes about one and a half hours to calculate FID. But the default timeout
argument of torch.distributed.init_process_group
is set as 30 minutes. That's why I got this error.
https://github.com/saic-mdal/CIPS/blob/eadae6e45d8c1f3657faa88a065b59990747cd16/train.py#L351
All you need to do is just to specify how long you want to set, e.g., timeout=datetime.timedelta(0, 7200)
for 2 hours, at the above line. Note that it takes datetime.timedelta
object and the unit is the second.
Secondly, I installed torch-fidelity==0.3.0
, which is the newest and the default version of the package when you install via pip
. But the argument names are modified compared to the previous version (0.2.0).
https://github.com/saic-mdal/CIPS/blob/eadae6e45d8c1f3657faa88a065b59990747cd16/calculate_fid.py#L25
So it doesn't take any positional arguments anymore. That's why I got TypeError
here. So, if you installed 0.3.0, you should
change the line to:
metrics_dict = calculate_metrics(input1=save_dir, input2=fid_dataset, cuda=True, isc=False, fid=True, kid=False, verbose=False)
or just install torch-fidelity==0.2.0
. I tried the former one but I believe the latter one will work either.
Hope my comments can help you.