Loss is nan while training with --fp16(A lots of times)

Question

Loss is nan while training with --fp16(A lots of times)

ylwhxht opened this issue 2 years ago · 10 comments

I use：
batchsize_pergpu=4, gpus = 2;
lr = 0.002
I know"If you encounter a gradient that becomes NaN during fp16 training, don't worry, it's normal. You can try a few more times."
But everytimes after a short training period (within 100 iterations), loss will become nan, and I have ensured that I have tried many times (dozens or even over a hundred times).
I only modified the path of the configuration file, and I ensured that every training session did not load the previous last_model with a loss of nan

It always shows：
epochs: 0%| | 0/24 [00:25<?, ?it/s, loss_hm=nan, loss_loc=nan, loss=nan, lr=2e-5, d_time=0.00(0.02), f_time=0.68(0.70), b_tWARNING:tensorboardX.x2num:NaN or Inf found in input tensor. | 33/19761 [00:23<3:55:35, 1.40it/s, total_it=33]
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.

And this is log：
https://paste.imlgw.top/2277

Answer 1 · 2023-06-08T03:36:11.000Z

First, you can consider the pillar-setting, which barely encounter this problem. By the way, lr_rate = 0.003 is used with a batch_size of 3 and 8 GPUs. You may try experiment with a lr value of 0.001.

Answer 2 · 2023-06-08T08:55:22.000Z

I use： batchsize_pergpu=4, gpus = 2; lr = 0.002 I know"If you encounter a gradient that becomes NaN during fp16 training, don't worry, it's normal. You can try a few more times." But everytimes after a short training period (within 100 iterations), loss will become nan, and I have ensured that I have tried many times (dozens or even over a hundred times). I only modified the path of the configuration file, and I ensured that every training session did not load the previous last_model with a loss of nan

It always shows： epochs: 0%| | 0/24 [00:25<?, ?it/s, loss_hm=nan, loss_loc=nan, loss=nan, lr=2e-5, d_time=0.00(0.02), f_time=0.68(0.70), b_tWARNING:tensorboardX.x2num:NaN or Inf found in input tensor. | 33/19761 [00:23<3:55:35, 1.40it/s, total_it=33] WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.

And this is log： https://paste.imlgw.top/2277

FP16 has a lot of training instability issues, if you want to reduce cuda memory can also use torch checkpoint, which will reduce about 50% memory consumption.

Answer 3 · 2023-06-08T09:01:30.000Z

I use： batchsize_pergpu=4, gpus = 2; lr = 0.002 I know"If you encounter a gradient that becomes NaN during fp16 training, don't worry, it's normal. You can try a few more times." But everytimes after a short training period (within 100 iterations), loss will become nan, and I have ensured that I have tried many times (dozens or even over a hundred times). I only modified the path of the configuration file, and I ensured that every training session did not load the previous last_model with a loss of nan

It always shows： epochs: 0%| | 0/24 [00:25<?, ?it/s, loss_hm=nan, loss_loc=nan, loss=nan, lr=2e-5, d_time=0.00(0.02), f_time=0.68(0.70), b_tWARNING:tensorboardX.x2num:NaN or Inf found in input tensor. | 33/19761 [00:23<3:55:35, 1.40it/s, total_it=33] WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor. WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.

And this is log： https://paste.imlgw.top/2277

I checked your log, and I noticed that you changed a few things, like LOSS_SCALE_FP16. I recommend you use the standard pillar config.

Answer 4 · 2023-06-08T16:37:34.000Z

我使用： batchsize_pergpu=4, gpus = 2; lr = 0.002 我知道“如果你在fp16训练的时候遇到梯度变成NaN，不用担心，这很正常，你可以多试几次。” 但是每次经过很短的训练周期（100次迭代以内），loss都会变成nan，我保证我已经尝试了很多次（几十次甚至一百多次）。我只修改了配置文件的路径，保证每次训练都没有加载之前的last_model有nan的损失
一直显示： epochs: 0%| | 0/24 [00:25<?, ?it/s, loss_hm=nan, loss_loc=nan, loss=nan, lr=2e-5, d_time=0.00(0.02), f_time=0.68(0.70), b_tWARNING:tensorboardX .x2num: 在输入张量中找到 NaN 或 Inf。| 33/19761 [00:23<3:55:35, 1.40it/s, total_it=33] 警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。
这是日志：https://paste.imlgw.top/2277

我检查了您的日志，发现您更改了一些内容，例如 LOSS_SCALE_FP16。我建议您使用标准支柱配置。

Sorry, the uploaded log did indeed modify 'LOSS_SCALE_FP16'. Since I have been unable to train using the default 32, I would like to try modifying 'LOSS_SCALE_FP16' to see if it can solve the problem. But from the beginning, I used the standard pillar config, but it didn't work

Answer 5 · 2023-06-18T07:33:33.000Z

Have you solved it?

Answer 6 · 2023-06-18T15:10:36.000Z

我使用： batchsize_pergpu=4, gpus = 2; lr = 0.002 我知道“如果你在fp16训练的时候遇到梯度变成NaN，不用担心，这很正常，你可以多试几次。” 但是每次经过很短的训练周期（100次迭代以内），loss都会变成nan，我保证我已经尝试了很多次（几十次甚至一百多次）。我只修改了配置文件的路径，保证每次训练都没有加载之前的last_model有nan的损失
一直显示： epochs: 0%| | 0/24 [00:25<?, ?it/s, loss_hm=nan, loss_loc=nan, loss=nan, lr=2e-5, d_time=0.00(0.02), f_time=0.68(0.70), b_tWARNING:tensorboardX .x2num: 在输入张量中找到 NaN 或 Inf。| 33/19761 [00:23<3:55:35, 1.40it/s, total_it=33] 警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。警告：tensorboardX.x2num：在输入张量中发现 NaN 或 Inf。
这是日志：https://paste.imlgw.top/2277

我检查了您的日志，发现您更改了一些内容，例如 LOSS_SCALE_FP16。我建议您使用标准支柱配置。

Sorry, the uploaded log did indeed modify 'LOSS_SCALE_FP16'. Since I have been unable to train using the default 32, I would like to try modifying 'LOSS_SCALE_FP16' to see if it can solve the problem. But from the beginning, I used the standard pillar config, but it didn't work

emmmm. No, I gave up using fp16 and tried another method "setting the DSVT to dimension 128" to train with batch_size=2 on 3090

Answer 7 · 2023-06-18T15:57:10.000Z

I recommend you use torch checkpoint, which will reduce about 50% memory consumption. If you need our support, we can update how to use torch checkpoint.

Answer 8 · 2023-06-18T17:28:54.000Z

I recommend you use torch checkpoint, which will reduce about 50% memory consumption. If you need our support, we can update how to use torch checkpoint.

Thank you very much for your kind reply!!

I just realized and tried that directly applying "checkpoint (model, batch dict)" is not feasible. Of course, I am not familiar with it and may need some support from you.

But I noticed that it cost time to reduce gpu memory. In fact, the reason why I wanted to increase the batch size also included speeding up training (using standard cfg without setting dimension 128, the batch size can only be set to 1, and the gpu memory usage rate is low about 14G/24G, leading to slower training speed)

I would like to know if "setting DSVT to dimension 128" has a significant impact on the results? Because it is indeed simple and can effectively increase batch size, thereby accelerating training speed.

Answer 9 · 2023-06-19T07:15:59.000Z

Yes, you can try dim128 first, which will save 30% GPU memory. We will add the torch.utils.checkpoints later. If you are in a hurry, you can refer to the official implementation of DSVT on PCDet. Line128 in DSVT.py

Answer 10 · 2023-06-19T07:28:15.000Z

Yes, you can try dim128 first, which will save 30% GPU memory. We will add the torch.utils.checkpoints later. If you are in a hurry, you can refer to the official implementation of DSVT on PCDet. Line128 in DSVT.py

Okay, I got it, thank you very much！