megvii-research/AnchorDETR

Model can not achieve a reasonable performance

luohao123 opened this issue · 8 comments

using bs=80, the AP at 30 can not goes up anymore:

image

what could be the reason?

The model, the Criterion all same as AnchorDETR, all loss weights are same....

but the performance so bad.... Even it indeed converge fast at the first serveral iterations....

the loss detail:

[11/09 18:06:08 d2.utils.events]:  eta: 5 days, 21:02:18  iter: 269419  total_loss: 7.201  loss_ce: 0.5373  loss_bbox: 0.1547  loss_giou: 0.4278  loss_ce_0: 0.6729  loss_bbox_0: 0.1874  loss_giou_0: 0.5025  loss_ce_1: 0.6068  loss_bbox_1: 0.173  loss_giou_1: 0.4575  loss_ce_2: 0.5717  loss_bbox_2: 0.1626  loss_giou_2: 0.4458  loss_ce_3: 0.541  loss_bbox_3: 0.1573  loss_giou_3: 0.4347  loss_ce_4: 0.5384  loss_bbox_4: 0.1562  loss_giou_4: 0.4286  time: 1.0181  data_time: 1.5350  lr: 0.00025  max_mem: 14791M
[11/09 18:06:51 d2.utils.events]:  eta: 5 days, 22:55:05  iter: 269439  total_loss: 7.124  loss_ce: 0.5293  loss_bbox: 0.1541  loss_giou: 0.437  loss_ce_0: 0.6552  loss_bbox_0: 0.1879  loss_giou_0: 0.506  loss_ce_1: 0.6087  loss_bbox_1: 0.1702  loss_giou_1: 0.4631  loss_ce_2: 0.561  loss_bbox_2: 0.1596  loss_giou_2: 0.4427  loss_ce_3: 0.5349  loss_bbox_3: 0.1559  loss_giou_3: 0.4441  loss_ce_4: 0.5312  loss_bbox_4: 0.154  loss_giou_4: 0.438  time: 1.0183  data_time: 1.5585  lr: 0.00025  max_mem: 14791M
[11/09 18:07:31 d2.utils.events]:  eta: 6 days, 0:18:00  iter: 269459  total_loss: 7.087  loss_ce: 0.5369  loss_bbox: 0.1535  loss_giou: 0.4281  loss_ce_0: 0.6547  loss_bbox_0: 0.1863  loss_giou_0: 0.4903  loss_ce_1: 0.6212  loss_bbox_1: 0.1689  loss_giou_1: 0.4419  loss_ce_2: 0.5789  loss_bbox_2: 0.1613  loss_giou_2: 0.432  loss_ce_3: 0.5384  loss_bbox_3: 0.1571  loss_giou_3: 0.4318  loss_ce_4: 0.5363  loss_bbox_4: 0.1552  loss_giou_4: 0.4288  time: 1.0185  data_time: 1.3509  lr: 0.00025  max_mem: 14791M
[11/09 18:08:15 d2.utils.events]:  eta: 6 days, 2:07:05  iter: 269479  total_loss: 7.144  loss_ce: 0.5288  loss_bbox: 0.1506  loss_giou: 0.4319  loss_ce_0: 0.6625  loss_bbox_0: 0.1837  loss_giou_0: 0.5021  loss_ce_1: 0.6254  loss_bbox_1: 0.1636  loss_giou_1: 0.4544  loss_ce_2: 0.5744  loss_bbox_2: 0.1575  loss_giou_2: 0.4443  loss_ce_3: 0.5387  loss_bbox_3: 0.1546  loss_giou_3: 0.4393  loss_ce_4: 0.5338  loss_bbox_4: 0.1506  loss_giou_4: 0.4333  time: 1.0187  data_time: 1.5636  lr: 0.00025  max_mem: 14791M
[11/09 18:08:53 d2.utils.events]:  eta: 6 days, 3:29:08  iter: 269499  total_loss: 7.108  loss_ce: 0.5345  loss_bbox: 0.1494  loss_giou: 0.4192  loss_ce_0: 0.6598  loss_bbox_0: 0.1836  loss_giou_0: 0.493  loss_ce_1: 0.6207  loss_bbox_1: 0.1635  loss_giou_1: 0.4406  loss_ce_2: 0.5721  loss_bbox_2: 0.1562  loss_giou_2: 0.4258  loss_ce_3: 0.5431  loss_bbox_3: 0.1535  loss_giou_3: 0.4198  loss_ce_4: 0.5371  loss_bbox_4: 0.1493  loss_giou_4: 0.4208  time: 1.0188  data_time: 1.3193  lr: 0.00025  max_mem: 14791M
[11/09 18:09:35 d2.utils.events]:  eta: 6 days, 5:01:11  iter: 269519  total_loss: 7.19  loss_ce: 0.5461  loss_bbox: 0.1545  loss_giou: 0.4287  loss_ce_0: 0.6717  loss_bbox_0: 0.1893  loss_giou_0: 0.4965  loss_ce_1: 0.6333  loss_bbox_1: 0.168  loss_giou_1: 0.446  loss_ce_2: 0.5802  loss_bbox_2: 0.1623  loss_giou_2: 0.4413  loss_ce_3: 0.5493  loss_bbox_3: 0.158  loss_giou_3: 0.4331  loss_ce_4: 0.5474  loss_bbox_4: 0.1548  loss_giou_4: 0.4282  time: 1.0190  data_time: 1.4416  lr: 0.00025  max_mem: 14791M
[11/09 18:10:15 d2.utils.events]:  eta: 6 days, 6:02:00  iter: 269539  total_loss: 7.134  loss_ce: 0.5348  loss_bbox: 0.1551  loss_giou: 0.429  loss_ce_0: 0.675  loss_bbox_0: 0.1894  loss_giou_0: 0.5001  loss_ce_1: 0.6203  loss_bbox_1: 0.1686  loss_giou_1: 0.4485  loss_ce_2: 0.5788  loss_bbox_2: 0.1595  loss_giou_2: 0.4378  loss_ce_3: 0.541  loss_bbox_3: 0.1573  loss_giou_3: 0.4331  loss_ce_4: 0.5333  loss_bbox_4: 0.1555  loss_giou_4: 0.428  time: 1.0192  data_time: 1.4106  lr: 0.00025  max_mem: 14791M
[11/09 18:10:57 d2.utils.events]:  eta: 6 days, 6:59:26  iter: 269559  total_loss: 7.307  loss_ce: 0.5538  loss_bbox: 0.1532  loss_giou: 0.4468  loss_ce_0: 0.6663  loss_bbox_0: 0.191  loss_giou_0: 0.5254  loss_ce_1: 0.6273  loss_bbox_1: 0.1713  loss_giou_1: 0.4765  loss_ce_2: 0.5875  loss_bbox_2: 0.1621  loss_giou_2: 0.4541  loss_ce_3: 0.5546  loss_bbox_3: 0.159  loss_giou_3: 0.4495  loss_ce_4: 0.5531  loss_bbox_4: 0.1542  loss_giou_4: 0.4497  time: 1.0194  data_time: 1.4533  lr: 0.00025  max_mem: 14791M
[11/09 18:11:40 d2.utils.events]:  eta: 6 days, 8:15:17  iter: 269579  total_loss: 7.227  loss_ce: 0.5415  loss_bbox: 0.1573  loss_giou: 0.4409  loss_ce_0: 0.6725  loss_bbox_0: 0.1919  loss_giou_0: 0.5168  loss_ce_1: 0.6282  loss_bbox_1: 0.1735  loss_giou_1: 0.4678  loss_ce_2: 0.5838  loss_bbox_2: 0.1632  loss_giou_2: 0.4516  loss_ce_3: 0.5479  loss_bbox_3: 0.1602  loss_giou_3: 0.4449  loss_ce_4: 0.5413  loss_bbox_4: 0.1586  loss_giou_4: 0.4386  time: 1.0196  data_time: 1.5338  lr: 0.00025  max_mem: 14791M
[11/09 18:12:22 d2.utils.events]:  eta: 6 days, 9:28:14  iter: 269599  total_loss: 7.328  loss_ce: 0.5701  loss_bbox: 0.1519  loss_giou: 0.4229  loss_ce_0: 0.6699  loss_bbox_0: 0.1927  loss_giou_0: 0.5077  loss_ce_1: 0.6325  loss_bbox_1: 0.1714  loss_giou_1: 0.462  loss_ce_2: 0.5905  loss_bbox_2: 0.1627  loss_giou_2: 0.4419  loss_ce_3: 0.573  loss_bbox_3: 0.1565  loss_giou_3: 0.4271  loss_ce_4: 0.5656  loss_bbox_4: 0.1539  loss_giou_4: 0.4261  time: 1.0198  data_time: 1.4658  lr: 0.00025  max_mem: 14791M
[11/09 18:12:58 d2.utils.events]:  eta: 6 days, 10:37:18  iter: 269619  total_loss: 7.19  loss_ce: 0.5339  loss_bbox: 0.151  loss_giou: 0.4358  loss_ce_0: 0.6533  loss_bbox_0: 0.1901  loss_giou_0: 0.5237  loss_ce_1: 0.6136  loss_bbox_1: 0.1697  loss_giou_1: 0.474  loss_ce_2: 0.5661  loss_bbox_2: 0.1593  loss_giou_2: 0.4502  loss_ce_3: 0.536  loss_bbox_3: 0.1549  loss_giou_3: 0.4385  loss_ce_4: 0.5346  loss_bbox_4: 0.152  loss_giou_4: 0.437  time: 1.0200  data_time: 1.2044  lr: 0.00025  max_mem: 14791M
[11/09 18:13:34 d2.utils.events]:  eta: 6 days, 11:24:26  iter: 269639  total_loss: 7.066  loss_ce: 0.5231  loss_bbox: 0.1525  loss_giou: 0.4221  loss_ce_0: 0.6647  loss_bbox_0: 0.1865  loss_giou_0: 0.4883  loss_ce_1: 0.621  loss_bbox_1: 0.1694  loss_giou_1: 0.4476  loss_ce_2: 0.5678  loss_bbox_2: 0.1581  loss_giou_2: 0.4252  loss_ce_3: 0.5282  loss_bbox_3: 0.1568  loss_giou_3: 0.4222  loss_ce_4: 0.5208  loss_bbox_4: 0.1535  loss_giou_4: 0.4208  time: 1.0201  data_time: 1.1394  lr: 0.00025  max_mem: 14791M
[11/09 18:14:15 d2.utils.events]:  eta: 6 days, 12:15:16  iter: 269659  total_loss: 7.009  loss_ce: 0.5305  loss_bbox: 0.1514  loss_giou: 0.4266  loss_ce_0: 0.6662  loss_bbox_0: 0.1827  loss_giou_0: 0.4943  loss_ce_1: 0.6239  loss_bbox_1: 0.1659  loss_giou_1: 0.4535  loss_ce_2: 0.5688  loss_bbox_2: 0.1579  loss_giou_2: 0.4381  loss_ce_3: 0.5381  loss_bbox_3: 0.1543  loss_giou_3: 0.4321  loss_ce_4: 0.5308  loss_bbox_4: 0.152  loss_giou_4: 0.4288  time: 1.0203  data_time: 1.4332  lr: 0.00025  max_mem: 14791M
[11/09 18:14:58 d2.utils.events]:  eta: 6 days, 13:13:26  iter: 269679  total_loss: 7.208  loss_ce: 0.528  loss_bbox: 0.1505  loss_giou: 0.4317  loss_ce_0: 0.6555  loss_bbox_0: 0.188  loss_giou_0: 0.5105  loss_ce_1: 0.615  loss_bbox_1: 0.1687  loss_giou_1: 0.4612  loss_ce_2: 0.5632  loss_bbox_2: 0.1545  loss_giou_2: 0.4478  loss_ce_3: 0.5334  loss_bbox_3: 0.1547  loss_giou_3: 0.4404  loss_ce_4: 0.5309  loss_bbox_4: 0.1506  loss_giou_4: 0.4336  time: 1.0205  data_time: 1.5298  lr: 0.00025  max_mem: 14791M
[11/09 18:15:40 d2.utils.events]:  eta: 6 days, 14:06:07  iter: 269699  total_loss: 7.038  loss_ce: 0.5201  loss_bbox: 0.1528  loss_giou: 0.4172  loss_ce_0: 0.6552  loss_bbox_0: 0.1873  loss_giou_0: 0.4905  loss_ce_1: 0.616  loss_bbox_1: 0.171  loss_giou_1: 0.4468  loss_ce_2: 0.5559  loss_bbox_2: 0.1616  loss_giou_2: 0.4358  loss_ce_3: 0.5275  loss_bbox_3: 0.156  loss_giou_3: 0.4246  loss_ce_4: 0.5207  loss_bbox_4: 0.1557  loss_giou_4: 0.4166  time: 1.0207  data_time: 1.4436  lr: 0.00025  max_mem: 14791M
[11/09 18:16:18 d2.utils.events]:  eta: 6 days, 14:48:21  iter: 269719  total_loss: 7.189  loss_ce: 0.5444  loss_bbox: 0.1526  loss_giou: 0.4341  loss_ce_0: 0.6616  loss_bbox_0: 0.1879  loss_giou_0: 0.5051  loss_ce_1: 0.6222  loss_bbox_1: 0.1699  loss_giou_1: 0.4601  loss_ce_2: 0.5797  loss_bbox_2: 0.1613  loss_giou_2: 0.4451  loss_ce_3: 0.5507  loss_bbox_3: 0.1576  loss_giou_3: 0.4412  loss_ce_4: 0.5477  loss_bbox_4: 0.1532  loss_giou_4: 0.4354  time: 1.0208  data_time: 1.2946  lr: 0.00025  max_mem: 14791M


I think you should first check your code. Can you try our official code but not your d2 version to check if it also has this problem?

@tangjiuqi097 I think reproduce with bs=8 on your code is meaningless, since your result must be right. But i using your code run bs=80, it core dumped, OOM, I only can try bs=80 on d2.
Currently, seems it can not get a good result with higher bs, same conclusion as @yformmer

@tangjiuqi097 what's your default bs? 1/gpu?

@luohao123 The default setting uses the DC5 feature. I believe it can not deal with 10 images on 1 device. Maybe your setting uses the C5 feature? But your result is also not right, the C5 feature should get about 35 AP before the learning rate drop.
The training memory will not have a significant difference from your d2 version. You can try our official code by setting the dilation to False if you want to try the C5 feature instead of DC5 feature.

BTW, you can also check the d2 version you reproduced by using the same batchsize to help us find out if the problem is related to the reproduced code.

@tangjiuqi097 thank u. I think my version was C5.

configs like:

RESNETS:
    DEPTH: 50
    STRIDE_IN_1X1: False
    OUT_FEATURES: ["res2", "res3", "res4", "res5"]

However, even with C5, the AP doesn't goes up even at iteration 250000, this speed is have no different than DETR, how to make it faster?
Current lr schedualer same as DETR d2 version:

SOLVER:
  AMP:
    ENABLED: true
  IMS_PER_BATCH: 80
  BASE_LR: 0.00025 # 0.00025 is better
  STEPS: (369600,)
  MAX_ITER: 554400
  WARMUP_FACTOR: 1.0
  WARMUP_ITERS: 10
  WEIGHT_DECAY: 0.0001
  OPTIMIZER: "ADAMW"
  BACKBONE_MULTIPLIER: 0.1
  CLIP_GRADIENTS:
    ENABLED: True
    CLIP_TYPE: "full_model"
    # CLIP_TYPE: "norm"
    CLIP_VALUE: 0.01
    NORM_TYPE: 2.0

maybe I need change STEPS?

@luohao123 Please do not use the d2 version you reproduced until you can reproduce the same performance with the same batchsize 8 as ours. Otherwise, we do not know what is the problem. You can try our official code for a larger batchsize.

This issue is not active for a long time and it will be closed in 5 days. Feel free to re-open it if you have further concerns.