POSTECH-CVLab/PyTorch-StudioGAN

G/D Losses go NaN (Not A Number) in WGAN-GP training

rl-max opened this issue · 1 comments

Hi,

I discovered that, in WGAN-GP for CIFAR10, Discriminator and Generator loss always display NaN (Not A Number) after some time elapsed.
The problem existed when I ran src/configs/CIFAR10/WGAN-GP.yaml file but I did not check if this also happens in other training data.

+) The one thing I consider interesting is that this always happens (I think) exactly at 1500 steps since the start of the training.

==================== This is the code I ran =======================

!python src/main.py -t -hdf5 -l -metrics is fid prdc -ref test --num_workers 2 --save_freq 5000 \
    -cfg src/configs/CIFAR10/WGAN-GP.yaml \
    -data ./cifar10 -save . -mpc --post_resizer friendly --eval_backbone InceptionV3_tf

==================== This is the log I got ====================

 [INFO] 2022-11-02 23:50:13 > Step:    900 Progress: 0.9% Elapsed: 0:07:59 Gen_loss: -2.982 Dis_loss: -1.749 
 [INFO] 2022-11-02 23:51:06 > Step:   1000 Progress: 1.0% Elapsed: 0:08:52 Gen_loss: 12.56 Dis_loss: -1.571 
 [INFO] 2022-11-02 23:52:00 > Step:   1100 Progress: 1.1% Elapsed: 0:09:46 Gen_loss: 6.941 Dis_loss: -1.659 
 [INFO] 2022-11-02 23:52:53 > Step:   1200 Progress: 1.2% Elapsed: 0:10:38 Gen_loss: 6.988 Dis_loss: -3.02 
 [INFO] 2022-11-02 23:53:46 > Step:   1300 Progress: 1.3% Elapsed: 0:11:32 Gen_loss: 9.656 Dis_loss: -1.676 
 [INFO] 2022-11-02 23:54:38 > Step:   1400 Progress: 1.4% Elapsed: 0:12:24 Gen_loss: 4.777 Dis_loss: -1.606 
 [INFO] 2022-11-02 23:55:29 > Step:   1500 Progress: 1.5% Elapsed: 0:13:14 Gen_loss: nan Dis_loss: nan 
 [INFO] 2022-11-02 23:56:17 > Step:   1600 Progress: 1.6% Elapsed: 0:14:03 Gen_loss: nan Dis_loss: nan 
 [INFO] 2022-11-02 23:57:05 > Step:   1700 Progress: 1.7% Elapsed: 0:14:51 Gen_loss: nan Dis_loss: nan 
 [INFO] 2022-11-02 23:57:54 > Step:   1800 Progress: 1.8% Elapsed: 0:15:39 Gen_loss: nan Dis_loss: nan 
 [INFO] 2022-11-02 23:58:42 > Step:   1900 Progress: 1.9% Elapsed: 0:16:28 Gen_loss: nan Dis_loss: nan 
 [INFO] 2022-11-02 23:59:31 > Step:   2000 Progress: 2.0% Elapsed: 0:17:17 Gen_loss: nan Dis_loss: nan

The issue has been solved after removing -mpc option