G/D Losses go NaN (Not A Number) in WGAN-GP training
rl-max opened this issue · 1 comments
rl-max commented
Hi,
I discovered that, in WGAN-GP for CIFAR10, Discriminator and Generator loss always display NaN (Not A Number) after some time elapsed.
The problem existed when I ran src/configs/CIFAR10/WGAN-GP.yaml file but I did not check if this also happens in other training data.
+) The one thing I consider interesting is that this always happens (I think) exactly at 1500 steps since the start of the training.
==================== This is the code I ran =======================
!python src/main.py -t -hdf5 -l -metrics is fid prdc -ref test --num_workers 2 --save_freq 5000 \
-cfg src/configs/CIFAR10/WGAN-GP.yaml \
-data ./cifar10 -save . -mpc --post_resizer friendly --eval_backbone InceptionV3_tf
==================== This is the log I got ====================
[INFO] 2022-11-02 23:50:13 > Step: 900 Progress: 0.9% Elapsed: 0:07:59 Gen_loss: -2.982 Dis_loss: -1.749
[INFO] 2022-11-02 23:51:06 > Step: 1000 Progress: 1.0% Elapsed: 0:08:52 Gen_loss: 12.56 Dis_loss: -1.571
[INFO] 2022-11-02 23:52:00 > Step: 1100 Progress: 1.1% Elapsed: 0:09:46 Gen_loss: 6.941 Dis_loss: -1.659
[INFO] 2022-11-02 23:52:53 > Step: 1200 Progress: 1.2% Elapsed: 0:10:38 Gen_loss: 6.988 Dis_loss: -3.02
[INFO] 2022-11-02 23:53:46 > Step: 1300 Progress: 1.3% Elapsed: 0:11:32 Gen_loss: 9.656 Dis_loss: -1.676
[INFO] 2022-11-02 23:54:38 > Step: 1400 Progress: 1.4% Elapsed: 0:12:24 Gen_loss: 4.777 Dis_loss: -1.606
[INFO] 2022-11-02 23:55:29 > Step: 1500 Progress: 1.5% Elapsed: 0:13:14 Gen_loss: nan Dis_loss: nan
[INFO] 2022-11-02 23:56:17 > Step: 1600 Progress: 1.6% Elapsed: 0:14:03 Gen_loss: nan Dis_loss: nan
[INFO] 2022-11-02 23:57:05 > Step: 1700 Progress: 1.7% Elapsed: 0:14:51 Gen_loss: nan Dis_loss: nan
[INFO] 2022-11-02 23:57:54 > Step: 1800 Progress: 1.8% Elapsed: 0:15:39 Gen_loss: nan Dis_loss: nan
[INFO] 2022-11-02 23:58:42 > Step: 1900 Progress: 1.9% Elapsed: 0:16:28 Gen_loss: nan Dis_loss: nan
[INFO] 2022-11-02 23:59:31 > Step: 2000 Progress: 2.0% Elapsed: 0:17:17 Gen_loss: nan Dis_loss: nan
rl-max commented
The issue has been solved after removing -mpc option