sczhou/CodeFormer

why In stage 3, train the GAN loss to nan and the tensor of the network output image to nan?

create-li opened this issue · 6 comments

Modify the cross layer connections of the networks in stages 2 and 3. In stage 3, train the GAN loss to nan and the tensor of the network output image to nan
In stages 2 and 3, the network_g modification is as follows: add 512 connections to the connect_list.Train loss and output tensor normally in the stage 2,But there were problems in the stage3
network_g:
type: CodeFormer
dim_embd: 512
n_head: 8
n_layers: 9
codebook_size: 1024
connect_list: ['32', '64', '128', '256','512']
fix_modules: ['quantize','generator']
The training log is as follows:
, lr:(5.000e-05,)] [eta: 3 days, 19:31:13, time (data): 0.700 (0.005)] l_feat_encoder: 7.5822e-02 cross_entropy_loss: 2.4184e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:54:25,822 INFO: [20240..][epoch: 0, iter: 1,400, lr:(5.000e-05,)] [eta: 3 days, 19:14:07, time (data): 0.698 (0.005)] l_feat_encoder: 6.7985e-02 cross_entropy_loss: 2.2916e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:55:36,566 INFO: [20240..][epoch: 0, iter: 1,500, lr:(5.000e-05,)] [eta: 3 days, 19:00:36, time (data): 0.700 (0.005)] l_feat_encoder: 9.1210e-02 cross_entropy_loss: 2.6622e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:56:46,976 INFO: [20240..][epoch: 0, iter: 1,600, lr:(5.000e-05,)] [eta: 3 days, 18:47:02, time (data): 0.701 (0.005)] l_feat_encoder: 6.9488e-02 cross_entropy_loss: 2.2836e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:57:57,528 INFO: [20240..][epoch: 0, iter: 1,700, lr:(5.000e-05,)] [eta: 3 days, 18:35:34, time (data): 0.701 (0.005)] l_feat_encoder: 7.9443e-02 cross_entropy_loss: 2.5066e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 20:59:08,149 INFO: [20240..][epoch: 0, iter: 1,800, lr:(5.000e-05,)] [eta: 3 days, 18:25:32, time (data): 0.700 (0.005)] l_feat_encoder: 9.7638e-02 cross_entropy_loss: 2.6671e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:00:18,859 INFO: [20240..][epoch: 0, iter: 1,900, lr:(5.000e-05,)] [eta: 3 days, 18:16:46, time (data): 0.725 (0.005)] l_feat_encoder: 9.8836e-02 cross_entropy_loss: 2.6258e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:01:29,411 INFO: [20240..][epoch: 0, iter: 2,000, lr:(5.000e-05,)] [eta: 3 days, 18:08:11, time (data): 0.703 (0.006)] l_feat_encoder: 7.3431e-02 cross_entropy_loss: 2.4257e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:02:40,124 INFO: [20240..][epoch: 0, iter: 2,100, lr:(5.000e-05,)] [eta: 3 days, 18:00:52, time (data): 0.716 (0.006)] l_feat_encoder: 6.9448e-02 cross_entropy_loss: 2.3052e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:03:50,985 INFO: [20240..][epoch: 0, iter: 2,200, lr:(5.000e-05,)] [eta: 3 days, 17:54:37, time (data): 0.706 (0.007)] l_feat_encoder: 6.8970e-02 cross_entropy_loss: 2.2546e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:05:01,672 INFO: [20240..][epoch: 0, iter: 2,300, lr:(5.000e-05,)] [eta: 3 days, 17:48:14, time (data): 0.701 (0.005)] l_feat_encoder: 8.1416e-02 cross_entropy_loss: 2.3619e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:06:12,473 INFO: [20240..][epoch: 0, iter: 2,400, lr:(5.000e-05,)] [eta: 3 days, 17:42:39, time (data): 0.705 (0.006)] l_feat_encoder: 1.1806e-01 cross_entropy_loss: 2.8084e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:07:23,324 INFO: [20240..][epoch: 0, iter: 2,500, lr:(5.000e-05,)] [eta: 3 days, 17:37:34, time (data): 0.729 (0.005)] l_feat_encoder: 9.9321e-02 cross_entropy_loss: 2.8572e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:08:34,325 INFO: [20240..][epoch: 0, iter: 2,600, lr:(5.000e-05,)] [eta: 3 days, 17:33:12, time (data): 0.702 (0.005)] l_feat_encoder: 7.6522e-02 cross_entropy_loss: 2.3831e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:09:45,171 INFO: [20240..][epoch: 0, iter: 2,700, lr:(5.000e-05,)] [eta: 3 days, 17:28:39, time (data): 0.708 (0.005)] l_feat_encoder: 1.1199e-01 cross_entropy_loss: 2.7737e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:10:55,931 INFO: [20240..][epoch: 0, iter: 2,800, lr:(5.000e-05,)] [eta: 3 days, 17:24:07, time (data): 0.699 (0.005)] l_feat_encoder: 8.9060e-02 cross_entropy_loss: 2.6272e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:12:06,759 INFO: [20240..][epoch: 0, iter: 2,900, lr:(5.000e-05,)] [eta: 3 days, 17:19:59, time (data): 0.708 (0.008)] l_feat_encoder: 1.1682e-01 cross_entropy_loss: 2.6958e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:13:17,653 INFO: [20240..][epoch: 0, iter: 3,000, lr:(5.000e-05,)] [eta: 3 days, 17:16:13, time (data): 0.730 (0.005)] l_feat_encoder: 8.0471e-02 cross_entropy_loss: 2.4796e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:14:28,445 INFO: [20240..][epoch: 0, iter: 3,100, lr:(5.000e-05,)] [eta: 3 days, 17:12:22, time (data): 0.704 (0.006)] l_feat_encoder: 7.5083e-02 cross_entropy_loss: 2.2212e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:15:39,388 INFO: [20240..][epoch: 0, iter: 3,200, lr:(5.000e-05,)] [eta: 3 days, 17:09:02, time (data): 0.765 (0.006)] l_feat_encoder: 8.9087e-02 cross_entropy_loss: 2.5269e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:16:50,238 INFO: [20240..][epoch: 0, iter: 3,300, lr:(4.999e-05,)] [eta: 3 days, 17:05:37, time (data): 0.701 (0.005)] l_feat_encoder: 9.0204e-02 cross_entropy_loss: 2.6163e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:18:01,015 INFO: [20240..][epoch: 0, iter: 3,400, lr:(4.999e-05,)] [eta: 3 days, 17:02:11, time (data): 0.708 (0.006)] l_feat_encoder: 8.1606e-02 cross_entropy_loss: 2.2606e+00 l_g_pix: nan l_g_percep: nan l_identity: nan 2024-04-03 21:19:12,007 INFO: [20240..][epoch: 0, iter: 3,500, lr:(4.999e-05,)] [eta: 3 days, 16:59:20, time (data): 0.712 (0.006)] l_feat_encoder: 1.1253e-01 cross_entropy_loss: 2.5051e+00 l_g_pix: nan l_g_percep: nan l_identity: nan

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file:
Screenshot from 2024-04-11 18-10-36

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.

I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)

You can check when the first time meet NAN in codeformer_joint_model.py file: Screenshot from 2024-04-11 18-10-36

As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file: Screenshot from 2024-04-11 18-10-36
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

I have the same NAN pix loss and perceptual loss when i train my 1024 resolution model and need to connect 512 size feat in stage3.
I found first time meet NAN after optimizer.step(), not the forward function, not the loss.backward, so maybe the lr is too large, just need to reduce the lr. (eg. set 5e-6 can solve this problem for me)
You can check when the first time meet NAN in codeformer_joint_model.py file: Screenshot from 2024-04-11 18-10-36
As for the train loss and output tensor normally in the stage 2, in my case, it is because the Fuse_sft_block produce the NAN gradient, while the network arch in stage2 have no such block, maybe you can check if the same as me.

Hi, may I ask your GPU info when training 1024 resolution in stage3? I had nvidia 4090 with 24G but got cuda out of memory when I tried to training 1024 resolution model in stage3. I keep the number of GPU 8 and set the batch size to 1. It still doesn't work. The connect_list is ['64', '128', '256','512']. Very appreciated if you have any suggestions.

Yeah, i use 1080ti to train the model, the official network arch need lots of memory when get 1024*1024 input in training time, so i compress the arch to train my model, thought it will loss some detail in the restored face

yes I find the model with 1024 resolution loss some detail too after I modify the arch to complete the training. Hard to balance the memory problem and the restoration fidelity. Anyway, thanks for your reply :)

#367 (comment)
@SherryXieYuchen
I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0.
When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1,As shown in the following figure
image

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

#367 (comment) @SherryXieYuchen I tried to set the learning rate to 5e-7 , but still encountered l_g_gan, l_d_real l_d_fake, etc. as 0. When I set the learning rate to 5e-8, although the loss is not 0 or nan, the network is almost not updated, and the output graph is brown when w=1,As shown in the following figure image

Yes, it is indeed the problem with Fuses_sft_block, but I don't know how to modify the fusion network

I haven't seen this output graph problem before. Although setting the learning rate too small could cause the network not updating.