FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

Question

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

ouyang11111 opened this issue a year ago · 2 comments

foggyspace code can training ,but when i run VOC2clipar my instructions is :CUDA_VISIBLE_DEVICES=0,1,2,3 python train_net.py --num-gpus 4 --config configs/faster_rcnn_R101_cross_clipart_b4.yaml OUTPUT_DIR output/exp_clipart_test
many others solution is down the version of detectron2 ,but in my compute platform or CUDA are not support a lower version of detectron2.
the train interupt in the beginning of a train (the first inter),with a very high loss
any body meet same question or solve this problem ?

I have try this : down the loss_weigh such as BBOX_REG_LOSS_WEIGHT LOSS_WEIGHT: 0.01 BBOX_REG_LOSS_WEIGHT: 0.005
CONTRASTIVE_LOSS_WEIGHT: 0.05 WEIGHT_DECAY: 0.0001
but the class loss is still extremely high :loss_cls: 3.153e+05
how to fix

Answer 1 · 2023-09-14T05:22:43.000Z

Hello, i encountered a same problem and solved the issue by downgrading the detectron2 from 0.4 to 0.3. I was able to solve the problem of encountering Inf/NaN at the very beginning but still had to encounter after few k iterations though. I guess there are many reasons for that error, but my case was this. Hope this helps.

Answer 2 · 2023-10-06T08:37:14.000Z

my RTX3090 CUDA12.0 still can not success solve it ,then i rent a new card such as RTX3060 then run success