NAN error: assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
Closed this issue · 10 comments
When training AnchorDETR with images per batch 8, and learning rate 0.0001, it keeps getting nan error, such as
generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all(), f"incorrect boxes, boxes1 {boxes1}"
AssertionError: incorrect boxes, boxes1 tensor([[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]], device='cuda:4')
Any idea to fix this issue?
@yformer Hi, I do not know what is your setting, as our default setting is batchsize 8 and learning rate 0.0001. Could you provide more detail?
Yes, I used the original training setting on 8 V100 machine, e.g., images per batch = 8, learning rate 0.0001, feature level = 1.
@yformer We have not met this problem. What modifications have you made?
@tangjiuqi097 The code is the same as anchor_detr.py except using d2 for dataloader and evaluation. I also saw this issue in detr, facebookresearch/detr#101.
@yformer Can you push the code to your repo? Otherwise, I can not help you to find out the problems.
@yformer I got this error today as well.
The reason for this is maybe I trained too long, about 300000 iterations and didn't step the lr. However I don't know what's the root reason for this, because just in a certain step it will raise this error, at least I produced twice, but the last eval epoch I got normal mAP, very strange.
@jinfagang , this error is quite weird. Sometimes it will show up. I did not figure out what is the real reason for it.
@yformer Hi, do you planned to opensource your AnchorDETR d2 implementation? I currently not able to reproduce the training result, I using original version can train with bs=2 though. I don't know where did I miss in my implementation.
However, I tried transfered all weights from official anchorDETR to my d2 version, the result looks normal. Although evaluation have some gap but that might caused by 91 output -> 81 output problem.
Can you opensource your d2 version which reproduced training result?
@yformer Hi, do you planned to opensource your AnchorDETR d2 implementation? I currently not able to reproduce the training result, I using original version can train with bs=2 though. I don't know where did I miss in my implementation.
However, I tried transfered all weights from official anchorDETR to my d2 version, the result looks normal. Although evaluation have some gap but that might caused by 91 output -> 81 output problem.
Can you opensource your d2 version which reproduced training result?
This issue is not active for a long time and it will be closed in 5 days. Feel free to re-open it if you have further concerns.