why In stage 3, train the GAN loss to nan and the tensor of the network output image to nan？

Question

why In stage 3, train the GAN loss to nan and the tensor of the network output image to nan？

create-li opened this issue 9 months ago · 6 comments

Modify the cross layer connections of the networks in stages 2 and 3. In stage 3, train the GAN loss to nan and the tensor of the network output image to nan
In stages 2 and 3, the network_g modification is as follows: add 512 connections to the connect_list.Train loss and output tensor normally in the stage 2,But there were problems in the stage3
network_g:
type: CodeFormer
dim_embd: 512
n_head: 8
n_layers: 9
codebook_size: 1024
connect_list: ['32', '64', '128', '256','512']
fix_modules: ['quantize','generator']
The training log is as follows:
, lr:(5.000e-05,)] [eta: 3 days, 19:31:13, time (data): 0.700 (0.005)] l_feat_encoder: 7.5822e-02 cross_entropy_loss: 2.4184e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:54:25,822 INFO: [20240..][epoch: 0, iter: 1,400, lr:(5.000e-05,)] [eta: 3 days, 19:14:07, time (data): 0.698 (0.005)] l_feat_encoder: 6.7985e-02 cross_entropy_loss: 2.2916e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:55:36,566 INFO: [20240..][epoch: 0, iter: 1,500, lr:(5.000e-05,)] [eta: 3 days, 19:00:36, time (data): 0.700 (0.005)] l_feat_encoder: 9.1210e-02 cross_entropy_loss: 2.6622e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:56:46,976 INFO: [20240..][epoch: 0, iter: 1,600, lr:(5.000e-05,)] [eta: 3 days, 18:47:02, time (data): 0.701 (0.005)] l_feat_encoder: 6.9488e-02 cross_entropy_loss: 2.2836e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:57:57,528 INFO: [20240..][epoch: 0, iter: 1,700, lr:(5.000e-05,)] [eta: 3 days, 18:35:34, time (data): 0.701 (0.005)] l_feat_encoder: 7.9443e-02 cross_entropy_loss: 2.5066e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:59:08,149 INFO: [20240..][epoch: 0, iter: 1,800, lr:(5.000e-05,)] [eta: 3 days, 18:25:32, time (data): 0.700 (0.005)] l_feat_encoder: 9.7638e-02 cross_entropy_loss: 2.6671e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:00:18,859 INFO: [20240..][epoch: 0, iter: 1,900, lr:(5.000e-05,)] [eta: 3 days, 18:16:46, time (data): 0.725 (0.005)] l_feat_encoder: 9.8836e-02 cross_entropy_loss: 2.6258e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:01:29,411 INFO: [20240..][epoch: 0, iter: 2,000, lr:(5.000e-05,)] [eta: 3 days, 18:08:11, time (data): 0.703 (0.006)] l_feat_encoder: 7.3431e-02 cross_entropy_loss: 2.4257e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:02:40,124 INFO: [20240..][epoch: 0, iter: 2,100, lr:(5.000e-05,)] [eta: 3 days, 18:00:52, time (data): 0.716 (0.006)] l_feat_encoder: 6.9448e-02 cross_entropy_loss: 2.3052e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:03:50,985 INFO: [20240..][epoch: 0, iter: 2,200, lr:(5.000e-05,)] [eta: 3 days, 17:54:37, time (data): 0.706 (0.007)] l_feat_encoder: 6.8970e-02 cross_entropy_loss: 2.2546e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:05:01,672 INFO: [20240..][epoch: 0, iter: 2,300, lr:(5.000e-05,)] [eta: 3 days, 17:48:14, time (data): 0.701 (0.005)] l_feat_encoder: 8.1416e-02 cross_entropy_loss: 2.3619e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:06:12,473 INFO: [20240..][epoch: 0, iter: 2,400, lr:(5.000e-05,)] [eta: 3 days, 17:42:39, time (data): 0.705 (0.006)] l_feat_encoder: 1.1806e-01 cross_entropy_loss: 2.8084e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:07:23,324 INFO: [20240..][epoch: 0, iter: 2,500, lr:(5.000e-05,)] [eta: 3 days, 17:37:34, time (data): 0.729 (0.005)] l_feat_encoder: 9.9321e-02 cross_entropy_loss: 2.8572e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:08:34,325 INFO: [20240..][epoch: 0, iter: 2,600, lr:(5.000e-05,)] [eta: 3 days, 17:33:12, time (data): 0.702 (0.005)] l_feat_encoder: 7.6522e-02 cross_entropy_loss: 2.3831e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:09:45,171 INFO: [20240..][epoch: 0, iter: 2,700, lr:(5.000e-05,)] [eta: 3 days, 17:28:39, time (data): 0.708 (0.005)] l_feat_encoder: 1.1199e-01 cross_entropy_loss: 2.7737e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:10:55,931 INFO: [20240..][epoch: 0, iter: 2,800, lr:(5.000e-05,)] [eta: 3 days, 17:24:07, time (data): 0.699 (0.005)] l_feat_encoder: 8.9060e-02 cross_entropy_loss: 2.6272e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:12:06,759 INFO: [20240..][epoch: 0, iter: 2,900, lr:(5.000e-05,)] [eta: 3 days, 17:19:59, time (data): 0.708 (0.008)] l_feat_encoder: 1.1682e-01 cross_entropy_loss: 2.6958e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:13:17,653 INFO: [20240..][epoch: 0, iter: 3,000, lr:(5.000e-05,)] [eta: 3 days, 17:16:13, time (data): 0.730 (0.005)] l_feat_encoder: 8.0471e-02 cross_entropy_loss: 2.4796e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:14:28,445 INFO: [20240..][epoch: 0, iter: 3,100, lr:(5.000e-05,)] [eta: 3 days, 17:12:22, time (data): 0.704 (0.006)] l_feat_encoder: 7.5083e-02 cross_entropy_loss: 2.2212e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:15:39,388 INFO: [20240..][epoch: 0, iter: 3,200, lr:(5.000e-05,)] [eta: 3 days, 17:09:02, time (data): 0.765 (0.006)] l_feat_encoder: 8.9087e-02 cross_entropy_loss: 2.5269e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:16:50,238 INFO: [20240..][epoch: 0, iter: 3,300, lr:(4.999e-05,)] [eta: 3 days, 17:05:37, time (data): 0.701 (0.005)] l_feat_encoder: 9.0204e-02 cross_entropy_loss: 2.6163e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:18:01,015 INFO: [20240..][epoch: 0, iter: 3,400, lr:(4.999e-05,)] [eta: 3 days, 17:02:11, time (data): 0.708 (0.006)] l_feat_encoder: 8.1606e-02 cross_entropy_loss: 2.2606e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:19:12,007 INFO: [20240..][epoch: 0, iter: 3,500, lr:(4.999e-05,)] [eta: 3 days, 16:59:20, time (data): 0.712 (0.006)] l_feat_encoder: 1.1253e-01 cross_entropy_loss: 2.5051e+00 l_g_pix: nan l_g_percep: nan l_identity: nan

Answer 1 · 2024-04-11T10:12:40.000Z

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file:

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Answer 2 · 2024-04-22T03:52:46.000Z

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file:

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Answer 3 · 2024-04-22T11:20:09.000Z

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file:
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

Answer 4 · 2024-04-23T06:10:31.000Z

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file:
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

yes I find the model with 1024 resolution loss some detail too after I modify the arch to complete the training. Hard to balance the memory problem and the restoration fidelity. Anyway, thanks for your reply :)

Answer 5 · 2024-04-23T09:01:15.000Z

#367 (comment)
@SherryXieYuchen
I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0.
When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1，As shown in the following figure

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

Answer 6 · 2024-04-24T06:35:14.000Z

#367 (comment) @SherryXieYuchen I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0. When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1，As shown in the following figure

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

I haven't seen this output graph problem before. Although setting the learning rate too small could cause the network not updating.