NAN error: assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Question

NAN error: assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Closed this issue 3 years ago · 10 comments

When training AnchorDETR with images per batch 8, and learning rate 0.0001, it keeps getting nan error, such as

generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all(), f"incorrect boxes, boxes1 {boxes1}"
AssertionError: incorrect boxes, boxes1 tensor([[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
...,
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]], device='cuda:4')

Any idea to fix this issue?

Answer 1 · 2021-11-02T03:32:12.000Z

@yformer Hi, I do not know what is your setting, as our default setting is batchsize 8 and learning rate 0.0001. Could you provide more detail?

Answer 2 · 2021-11-02T16:16:54.000Z

Yes, I used the original training setting on 8 V100 machine, e.g., images per batch = 8, learning rate 0.0001, feature level = 1.

Answer 3 · 2021-11-03T10:22:40.000Z

@yformer We have not met this problem. What modifications have you made?

Answer 4 · 2021-11-03T17:33:19.000Z

@tangjiuqi097 The code is the same as anchor_detr.py except using d2 for dataloader and evaluation. I also saw this issue in detr, facebookresearch/detr#101.

Answer 5 · 2021-11-05T06:21:38.000Z

@yformer Can you push the code to your repo? Otherwise, I can not help you to find out the problems.

Answer 6 · 2021-11-12T06:15:11.000Z

@yformer I got this error today as well.

The reason for this is maybe I trained too long, about 300000 iterations and didn't step the lr. However I don't know what's the root reason for this, because just in a certain step it will raise this error, at least I produced twice, but the last eval epoch I got normal mAP, very strange.

Answer 7 · 2021-11-19T18:31:04.000Z

@jinfagang , this error is quite weird. Sometimes it will show up. I did not figure out what is the real reason for it.

Answer 8 · 2021-11-20T05:13:16.000Z

@yformer Hi, do you planned to opensource your AnchorDETR d2 implementation? I currently not able to reproduce the training result, I using original version can train with bs=2 though. I don't know where did I miss in my implementation.

However, I tried transfered all weights from official anchorDETR to my d2 version, the result looks normal. Although evaluation have some gap but that might caused by 91 output -> 81 output problem.

Can you opensource your d2 version which reproduced training result?

Answer 9 · 2021-11-20T05:13:20.000Z

@yformer Hi, do you planned to opensource your AnchorDETR d2 implementation? I currently not able to reproduce the training result, I using original version can train with bs=2 though. I don't know where did I miss in my implementation.

However, I tried transfered all weights from official anchorDETR to my d2 version, the result looks normal. Although evaluation have some gap but that might caused by 91 output -> 81 output problem.

Can you opensource your d2 version which reproduced training result?

Answer 10 · 2022-03-09T17:41:55.000Z

This issue is not active for a long time and it will be closed in 5 days. Feel free to re-open it if you have further concerns.