WARNING :root:NaN or Inf in input tensor problem

Question

WARNING :root:NaN or Inf in input tensor problem

wangzhiwei-python opened this issue 4 years ago · 12 comments

INFO:global:Progress: 110 / 83320 [0%], Speed: 1.173 s/iter, ETA 1:03:07 (D:H:M)

WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
[2020-08-02 19:30:50,046-rk0-train_siammask_refine.py#290] Epoch: [1][120/4166] lr: 0.010000 batch_time: 0.435500 (1.143975) data_time: 0.000047 (0.692028) rpn_cls_loss: 0.226122 (0.179825) rpn_loc_loss: 0.189315 (0.190289) rpn_mask_loss: inf (inf) siammask_loss: inf (inf) mask_iou_mean: 0.000000 (0.000000) mask_iou_at_5: 0.000000 (0.000000) mask_iou_at_7: 0.000000 (0.000000)
INFO:global:Epoch: [1][120/4166] lr: 0.010000 batch_time: 0.435500 (1.143975)data_time: 0.000047 (0.692028) rpn_cls_loss: 0.226122 (0.179825) rpn_loc_loss: 0.189315 (0.190289) rpn_mask_loss: inf (inf) siammask_loss: inf (inf) mask_iou_mean: 0.000000 (0.000000) mask_iou_at_5: 0.000000 (0.000000) mask_iou_at_7: 0.000000 (0.000000)
[2020-08-02 19:30:50,046-rk0-log_helper.py# 97] Progress: 120 / 83320 [0%], Speed: 1.144 s/iter, ETA 1:02:26 (D:H:M)

INFO:global:Progress: 120 / 83320 [0%], Speed: 1.144 s/iter, ETA 1:02:26 (D:H:M)

WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
[2020-08-02 19:31:00,345-rk0-train_siammask_refine.py#290] Epoch: [1][130/4166] lr: 0.010000 batch_time: 0.420345 (1.135102) data_time: 0.000030 (0.685210) rpn_cls_loss: 0.094134 (0.179464) rpn_loc_loss: 0.174147 (0.190233) rpn_mask_loss: inf (inf) siammask_loss: inf (inf) mask_iou_mean: 0.000000 (0.000000) mask_iou_at_5: 0.000000 (0.000000) mask_iou_at_7: 0.000000 (0.000000)

When I train the refine model, the above problem occurs.Then when I adjust lr from 0.01 to 0.001，a warning still appears after training the third epoch.This is my loss. Can anyone help me solve this problem?Thanks！
PICTURE:/home/wzw/.config/tencent-qq//AppData/file//sendpix4.jpg

Answer 1 · 2020-08-02T12:00:39.000Z

wangzhiwei-python commented 4 years ago

Answer 2 · 2020-11-18T08:40:00.000Z

@wangzhiwei-python Hi, I met the same problem, can you tell me how you solved it? thank you.

Answer 3 · 2021-03-18T11:39:52.000Z

i am facing same problem

Answer 4 · 2022-03-16T10:44:45.000Z

When I train the refine model, the above problem occurs.Then when I adjust lr from 0.01 to 0.001，a warning still appears after training the third epoch.This is my loss. Can anyone help me solve this problem?Thanks！

Hi, I met the same problem, can you tell me how you solved it? thank you.

Answer 5 · 2022-03-16T11:02:07.000Z

Hi, Did you solve this problem?

Answer 6 · 2022-03-16T11:02:53.000Z

i am facing same problem

@wangzhiwei-python Hi, I met the same problem, can you tell me how you solved it? thank you.

Hi, Did you solve this problem?

Answer 7 · 2022-03-16T11:03:06.000Z

i am facing same problem

Hi, Did you solve this problem?

Answer 8 · 2022-08-09T03:15:48.000Z

I got the same problem, maybe.
First, I trained a SiamMask_base model and got the best snapshot weight with an accuracy 0.652 and robustness 0.308 on VOT-2016, which is lower than the accuracy mentioned in the paper. But I think that is enough.
Then, I used this weight as the pre-trained weight to train a SiamMask_refine model. But I got infinity siammask_loss and rpn_mask_loss at the early epochs. I reduced the learning rate from 0.01 to 0.001 and stopped at 0.000125, and it worked.

Answer 9 · 2022-10-16T08:37:56.000Z

@nanowhiter Hello, what version of torch do you have installed?

Answer 10 · 2022-10-17T02:45:02.000Z

@xiaofengBian I use PyTorch 1.5.0.

Answer 11 · 2022-10-17T02:52:02.000Z

@nanowhiter Thank you for your timely reply. My training code still can't run. Could you send your revised training code? Thank you. My email is 945414538@qq.com。Thank you very much。

Answer 12 · 2022-10-17T03:01:04.000Z

@nanowhiter Or add a contact information to discuss it。This problem has been bothering me for a long time。I hope you can help me。