nan loss while training

Question

nan loss while training

Park-ing-lot opened this issue 3 years ago · 1 comments

I use
CUDA_VISIBLE_DEVICES=2 python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_101_FPN_1x.yaml" --skip-test SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000

this command to follow your instruction and I use coco 2017 train and val data.

While training, the loss keeps around 8 and did not drop.
after 6000 steps, the model spits nan loss.

do you have any idea why nan loss is coming?
What is the problem?

Answer 1 · 2021-10-11T17:53:45.000Z

I and my partner had got the same problem. We tried to train the R101 network after uncommenting rpn. It is working (our present iteration number is 31K+). We agree that it is different from the CVPR VCRCNN paper's training method. We think the backbone would not be trained well after removing RPN. We may be wrong.

Request @Wangt-CN comment in this regard.