nan loss while training
Park-ing-lot opened this issue · 1 comments
I use
CUDA_VISIBLE_DEVICES=2 python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_101_FPN_1x.yaml" --skip-test SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000
this command to follow your instruction and I use coco 2017 train and val data.
While training, the loss keeps around 8 and did not drop.
after 6000 steps, the model spits nan loss.
do you have any idea why nan loss is coming?
What is the problem?
I and my partner had got the same problem. We tried to train the R101 network after uncommenting rpn
. It is working (our present iteration number is 31K+). We agree that it is different from the CVPR VCRCNN paper's training method. We think the backbone
would not be trained well after removing RPN. We may be wrong.
Request @Wangt-CN comment in this regard.