Model can not achieve a reasonable performance
luohao123 opened this issue · 8 comments
the loss detail:
[11/09 18:06:08 d2.utils.events]: eta: 5 days, 21:02:18 iter: 269419 total_loss: 7.201 loss_ce: 0.5373 loss_bbox: 0.1547 loss_giou: 0.4278 loss_ce_0: 0.6729 loss_bbox_0: 0.1874 loss_giou_0: 0.5025 loss_ce_1: 0.6068 loss_bbox_1: 0.173 loss_giou_1: 0.4575 loss_ce_2: 0.5717 loss_bbox_2: 0.1626 loss_giou_2: 0.4458 loss_ce_3: 0.541 loss_bbox_3: 0.1573 loss_giou_3: 0.4347 loss_ce_4: 0.5384 loss_bbox_4: 0.1562 loss_giou_4: 0.4286 time: 1.0181 data_time: 1.5350 lr: 0.00025 max_mem: 14791M
[11/09 18:06:51 d2.utils.events]: eta: 5 days, 22:55:05 iter: 269439 total_loss: 7.124 loss_ce: 0.5293 loss_bbox: 0.1541 loss_giou: 0.437 loss_ce_0: 0.6552 loss_bbox_0: 0.1879 loss_giou_0: 0.506 loss_ce_1: 0.6087 loss_bbox_1: 0.1702 loss_giou_1: 0.4631 loss_ce_2: 0.561 loss_bbox_2: 0.1596 loss_giou_2: 0.4427 loss_ce_3: 0.5349 loss_bbox_3: 0.1559 loss_giou_3: 0.4441 loss_ce_4: 0.5312 loss_bbox_4: 0.154 loss_giou_4: 0.438 time: 1.0183 data_time: 1.5585 lr: 0.00025 max_mem: 14791M
[11/09 18:07:31 d2.utils.events]: eta: 6 days, 0:18:00 iter: 269459 total_loss: 7.087 loss_ce: 0.5369 loss_bbox: 0.1535 loss_giou: 0.4281 loss_ce_0: 0.6547 loss_bbox_0: 0.1863 loss_giou_0: 0.4903 loss_ce_1: 0.6212 loss_bbox_1: 0.1689 loss_giou_1: 0.4419 loss_ce_2: 0.5789 loss_bbox_2: 0.1613 loss_giou_2: 0.432 loss_ce_3: 0.5384 loss_bbox_3: 0.1571 loss_giou_3: 0.4318 loss_ce_4: 0.5363 loss_bbox_4: 0.1552 loss_giou_4: 0.4288 time: 1.0185 data_time: 1.3509 lr: 0.00025 max_mem: 14791M
[11/09 18:08:15 d2.utils.events]: eta: 6 days, 2:07:05 iter: 269479 total_loss: 7.144 loss_ce: 0.5288 loss_bbox: 0.1506 loss_giou: 0.4319 loss_ce_0: 0.6625 loss_bbox_0: 0.1837 loss_giou_0: 0.5021 loss_ce_1: 0.6254 loss_bbox_1: 0.1636 loss_giou_1: 0.4544 loss_ce_2: 0.5744 loss_bbox_2: 0.1575 loss_giou_2: 0.4443 loss_ce_3: 0.5387 loss_bbox_3: 0.1546 loss_giou_3: 0.4393 loss_ce_4: 0.5338 loss_bbox_4: 0.1506 loss_giou_4: 0.4333 time: 1.0187 data_time: 1.5636 lr: 0.00025 max_mem: 14791M
[11/09 18:08:53 d2.utils.events]: eta: 6 days, 3:29:08 iter: 269499 total_loss: 7.108 loss_ce: 0.5345 loss_bbox: 0.1494 loss_giou: 0.4192 loss_ce_0: 0.6598 loss_bbox_0: 0.1836 loss_giou_0: 0.493 loss_ce_1: 0.6207 loss_bbox_1: 0.1635 loss_giou_1: 0.4406 loss_ce_2: 0.5721 loss_bbox_2: 0.1562 loss_giou_2: 0.4258 loss_ce_3: 0.5431 loss_bbox_3: 0.1535 loss_giou_3: 0.4198 loss_ce_4: 0.5371 loss_bbox_4: 0.1493 loss_giou_4: 0.4208 time: 1.0188 data_time: 1.3193 lr: 0.00025 max_mem: 14791M
[11/09 18:09:35 d2.utils.events]: eta: 6 days, 5:01:11 iter: 269519 total_loss: 7.19 loss_ce: 0.5461 loss_bbox: 0.1545 loss_giou: 0.4287 loss_ce_0: 0.6717 loss_bbox_0: 0.1893 loss_giou_0: 0.4965 loss_ce_1: 0.6333 loss_bbox_1: 0.168 loss_giou_1: 0.446 loss_ce_2: 0.5802 loss_bbox_2: 0.1623 loss_giou_2: 0.4413 loss_ce_3: 0.5493 loss_bbox_3: 0.158 loss_giou_3: 0.4331 loss_ce_4: 0.5474 loss_bbox_4: 0.1548 loss_giou_4: 0.4282 time: 1.0190 data_time: 1.4416 lr: 0.00025 max_mem: 14791M
[11/09 18:10:15 d2.utils.events]: eta: 6 days, 6:02:00 iter: 269539 total_loss: 7.134 loss_ce: 0.5348 loss_bbox: 0.1551 loss_giou: 0.429 loss_ce_0: 0.675 loss_bbox_0: 0.1894 loss_giou_0: 0.5001 loss_ce_1: 0.6203 loss_bbox_1: 0.1686 loss_giou_1: 0.4485 loss_ce_2: 0.5788 loss_bbox_2: 0.1595 loss_giou_2: 0.4378 loss_ce_3: 0.541 loss_bbox_3: 0.1573 loss_giou_3: 0.4331 loss_ce_4: 0.5333 loss_bbox_4: 0.1555 loss_giou_4: 0.428 time: 1.0192 data_time: 1.4106 lr: 0.00025 max_mem: 14791M
[11/09 18:10:57 d2.utils.events]: eta: 6 days, 6:59:26 iter: 269559 total_loss: 7.307 loss_ce: 0.5538 loss_bbox: 0.1532 loss_giou: 0.4468 loss_ce_0: 0.6663 loss_bbox_0: 0.191 loss_giou_0: 0.5254 loss_ce_1: 0.6273 loss_bbox_1: 0.1713 loss_giou_1: 0.4765 loss_ce_2: 0.5875 loss_bbox_2: 0.1621 loss_giou_2: 0.4541 loss_ce_3: 0.5546 loss_bbox_3: 0.159 loss_giou_3: 0.4495 loss_ce_4: 0.5531 loss_bbox_4: 0.1542 loss_giou_4: 0.4497 time: 1.0194 data_time: 1.4533 lr: 0.00025 max_mem: 14791M
[11/09 18:11:40 d2.utils.events]: eta: 6 days, 8:15:17 iter: 269579 total_loss: 7.227 loss_ce: 0.5415 loss_bbox: 0.1573 loss_giou: 0.4409 loss_ce_0: 0.6725 loss_bbox_0: 0.1919 loss_giou_0: 0.5168 loss_ce_1: 0.6282 loss_bbox_1: 0.1735 loss_giou_1: 0.4678 loss_ce_2: 0.5838 loss_bbox_2: 0.1632 loss_giou_2: 0.4516 loss_ce_3: 0.5479 loss_bbox_3: 0.1602 loss_giou_3: 0.4449 loss_ce_4: 0.5413 loss_bbox_4: 0.1586 loss_giou_4: 0.4386 time: 1.0196 data_time: 1.5338 lr: 0.00025 max_mem: 14791M
[11/09 18:12:22 d2.utils.events]: eta: 6 days, 9:28:14 iter: 269599 total_loss: 7.328 loss_ce: 0.5701 loss_bbox: 0.1519 loss_giou: 0.4229 loss_ce_0: 0.6699 loss_bbox_0: 0.1927 loss_giou_0: 0.5077 loss_ce_1: 0.6325 loss_bbox_1: 0.1714 loss_giou_1: 0.462 loss_ce_2: 0.5905 loss_bbox_2: 0.1627 loss_giou_2: 0.4419 loss_ce_3: 0.573 loss_bbox_3: 0.1565 loss_giou_3: 0.4271 loss_ce_4: 0.5656 loss_bbox_4: 0.1539 loss_giou_4: 0.4261 time: 1.0198 data_time: 1.4658 lr: 0.00025 max_mem: 14791M
[11/09 18:12:58 d2.utils.events]: eta: 6 days, 10:37:18 iter: 269619 total_loss: 7.19 loss_ce: 0.5339 loss_bbox: 0.151 loss_giou: 0.4358 loss_ce_0: 0.6533 loss_bbox_0: 0.1901 loss_giou_0: 0.5237 loss_ce_1: 0.6136 loss_bbox_1: 0.1697 loss_giou_1: 0.474 loss_ce_2: 0.5661 loss_bbox_2: 0.1593 loss_giou_2: 0.4502 loss_ce_3: 0.536 loss_bbox_3: 0.1549 loss_giou_3: 0.4385 loss_ce_4: 0.5346 loss_bbox_4: 0.152 loss_giou_4: 0.437 time: 1.0200 data_time: 1.2044 lr: 0.00025 max_mem: 14791M
[11/09 18:13:34 d2.utils.events]: eta: 6 days, 11:24:26 iter: 269639 total_loss: 7.066 loss_ce: 0.5231 loss_bbox: 0.1525 loss_giou: 0.4221 loss_ce_0: 0.6647 loss_bbox_0: 0.1865 loss_giou_0: 0.4883 loss_ce_1: 0.621 loss_bbox_1: 0.1694 loss_giou_1: 0.4476 loss_ce_2: 0.5678 loss_bbox_2: 0.1581 loss_giou_2: 0.4252 loss_ce_3: 0.5282 loss_bbox_3: 0.1568 loss_giou_3: 0.4222 loss_ce_4: 0.5208 loss_bbox_4: 0.1535 loss_giou_4: 0.4208 time: 1.0201 data_time: 1.1394 lr: 0.00025 max_mem: 14791M
[11/09 18:14:15 d2.utils.events]: eta: 6 days, 12:15:16 iter: 269659 total_loss: 7.009 loss_ce: 0.5305 loss_bbox: 0.1514 loss_giou: 0.4266 loss_ce_0: 0.6662 loss_bbox_0: 0.1827 loss_giou_0: 0.4943 loss_ce_1: 0.6239 loss_bbox_1: 0.1659 loss_giou_1: 0.4535 loss_ce_2: 0.5688 loss_bbox_2: 0.1579 loss_giou_2: 0.4381 loss_ce_3: 0.5381 loss_bbox_3: 0.1543 loss_giou_3: 0.4321 loss_ce_4: 0.5308 loss_bbox_4: 0.152 loss_giou_4: 0.4288 time: 1.0203 data_time: 1.4332 lr: 0.00025 max_mem: 14791M
[11/09 18:14:58 d2.utils.events]: eta: 6 days, 13:13:26 iter: 269679 total_loss: 7.208 loss_ce: 0.528 loss_bbox: 0.1505 loss_giou: 0.4317 loss_ce_0: 0.6555 loss_bbox_0: 0.188 loss_giou_0: 0.5105 loss_ce_1: 0.615 loss_bbox_1: 0.1687 loss_giou_1: 0.4612 loss_ce_2: 0.5632 loss_bbox_2: 0.1545 loss_giou_2: 0.4478 loss_ce_3: 0.5334 loss_bbox_3: 0.1547 loss_giou_3: 0.4404 loss_ce_4: 0.5309 loss_bbox_4: 0.1506 loss_giou_4: 0.4336 time: 1.0205 data_time: 1.5298 lr: 0.00025 max_mem: 14791M
[11/09 18:15:40 d2.utils.events]: eta: 6 days, 14:06:07 iter: 269699 total_loss: 7.038 loss_ce: 0.5201 loss_bbox: 0.1528 loss_giou: 0.4172 loss_ce_0: 0.6552 loss_bbox_0: 0.1873 loss_giou_0: 0.4905 loss_ce_1: 0.616 loss_bbox_1: 0.171 loss_giou_1: 0.4468 loss_ce_2: 0.5559 loss_bbox_2: 0.1616 loss_giou_2: 0.4358 loss_ce_3: 0.5275 loss_bbox_3: 0.156 loss_giou_3: 0.4246 loss_ce_4: 0.5207 loss_bbox_4: 0.1557 loss_giou_4: 0.4166 time: 1.0207 data_time: 1.4436 lr: 0.00025 max_mem: 14791M
[11/09 18:16:18 d2.utils.events]: eta: 6 days, 14:48:21 iter: 269719 total_loss: 7.189 loss_ce: 0.5444 loss_bbox: 0.1526 loss_giou: 0.4341 loss_ce_0: 0.6616 loss_bbox_0: 0.1879 loss_giou_0: 0.5051 loss_ce_1: 0.6222 loss_bbox_1: 0.1699 loss_giou_1: 0.4601 loss_ce_2: 0.5797 loss_bbox_2: 0.1613 loss_giou_2: 0.4451 loss_ce_3: 0.5507 loss_bbox_3: 0.1576 loss_giou_3: 0.4412 loss_ce_4: 0.5477 loss_bbox_4: 0.1532 loss_giou_4: 0.4354 time: 1.0208 data_time: 1.2946 lr: 0.00025 max_mem: 14791M
I think you should first check your code. Can you try our official code but not your d2 version to check if it also has this problem?
@tangjiuqi097 I think reproduce with bs=8 on your code is meaningless, since your result must be right. But i using your code run bs=80, it core dumped, OOM, I only can try bs=80 on d2.
Currently, seems it can not get a good result with higher bs, same conclusion as @yformmer
@tangjiuqi097 what's your default bs? 1/gpu?
@luohao123 The default setting uses the DC5 feature. I believe it can not deal with 10 images on 1 device. Maybe your setting uses the C5 feature? But your result is also not right, the C5 feature should get about 35 AP before the learning rate drop.
The training memory will not have a significant difference from your d2 version. You can try our official code by setting the dilation
to False
if you want to try the C5 feature instead of DC5 feature.
BTW, you can also check the d2 version you reproduced by using the same batchsize to help us find out if the problem is related to the reproduced code.
@tangjiuqi097 thank u. I think my version was C5.
configs like:
RESNETS:
DEPTH: 50
STRIDE_IN_1X1: False
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
However, even with C5, the AP doesn't goes up even at iteration 250000, this speed is have no different than DETR, how to make it faster?
Current lr schedualer same as DETR d2 version:
SOLVER:
AMP:
ENABLED: true
IMS_PER_BATCH: 80
BASE_LR: 0.00025 # 0.00025 is better
STEPS: (369600,)
MAX_ITER: 554400
WARMUP_FACTOR: 1.0
WARMUP_ITERS: 10
WEIGHT_DECAY: 0.0001
OPTIMIZER: "ADAMW"
BACKBONE_MULTIPLIER: 0.1
CLIP_GRADIENTS:
ENABLED: True
CLIP_TYPE: "full_model"
# CLIP_TYPE: "norm"
CLIP_VALUE: 0.01
NORM_TYPE: 2.0
maybe I need change STEPS
?
@luohao123 Please do not use the d2 version you reproduced until you can reproduce the same performance with the same batchsize 8 as ours. Otherwise, we do not know what is the problem. You can try our official code for a larger batchsize.
This issue is not active for a long time and it will be closed in 5 days. Feel free to re-open it if you have further concerns.