Higher evaluation scores than the ones in the paper
Closed this issue · 9 comments
Hi,
I ran your evaluation scripts on the val2014 set of MS-COCO as per your instructions (I used your pretrained weights, I didn't train a model myself)
python test.py [gpu_id] [model] [--init_weights=xxx.caffemodel]
python evalCOCO.py [model]
and got the following results.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.204
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.347
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.213
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.035
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.229
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.450
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.086
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.277
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.361
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.493
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.659
These seem to be even better than the reported values in the technical report (27.7% vs 16.90% for AR@10 and 36.1% vs 31.30% for AR@100).
What am I missing? Is this the wrong way to reproduce your results?
More info:
- I set the useCats command line argument in evalCOCO.py to False (True by default, which naturally gives really low scores).
- I am using the fm-res39 model.
- I removed the useCats parameter in this call of gen_masks in test.py (useCats=args.useCats, vis=args.debug) as it doesn't appear in the definition of gen_masks. I also changed the last parameter name in the function call from "vis" to "images". Is this the intended behaviour? If so, I can prepare a pull request in a few hours.
All in all, I must say you've done excellent work, both the residual neck and the attention module are simple yet elegant ideas; kudos.
Hi @PavlosMelissinos,
1.The default value of useCats
should be False
and that of useSegm
should be True
. The result you provided is based on box-level criteria. I'm very sorry that I got the order back by mistakes.
2. About point 3, you are right. I'll be very pleased if you could submit a PR.
Feel more than happy to see your interest and participation.
It seems that you were using fm-res39
instead of fm-res39zoom
. But if you want to reproduce our result, you should use the zoom
one with the same param.
Done, check your PR page!
Thanks for your help!
Hi, thanks to @PavlosMelissinos's pull request #12 I could run a test on the first 5k images of the MS-COCO validation set (val2014). However I could not obtain similar results as reported in your paper.
I ran test as follows
$ python test.py 0 fm-res39zoom --init_weights=fm-res39_final_params.caffemodel
$ python evalCOCO.py fm-res39zoom --max_proposal 1000 --nms_threshold 1
and got these results
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.076
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 100 ] = 0.166
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 100 ] = 0.056
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.013
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.094
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.163
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.168
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.289
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.363
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.041
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.400
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.552
When I kept nms_threshold as default, AR@10, AR@100, AR@1k results appeared to be worse
$ python evalCOCO.py fm-res39zoom --max_proposal 1000
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.133
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 100 ] = 0.308
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 100 ] = 0.088
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.017
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.150
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.304
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.203
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.287
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.317
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.049
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.403
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.525
What did I do wrong? Can you tell me the right way to reproduce your results?
P/S: I changed cocoEval.summarize()
code so that it can output AR@10, AR@100, AR@1k.
def _summarizeDets():
stats = np.zeros((12,))
stats[0] = _summarize(1)
stats[1] = _summarize(1, iouThr=.5, maxDets=100)
stats[2] = _summarize(1, iouThr=.75, maxDets=100)
stats[3] = _summarize(1, areaRng='small', maxDets=100)
stats[4] = _summarize(1, areaRng='medium', maxDets=100)
stats[5] = _summarize(1, areaRng='large', maxDets=100)
stats[6] = _summarize(0, maxDets=10)
stats[7] = _summarize(0, maxDets=100)
stats[8] = _summarize(0, maxDets=1000)
stats[9] = _summarize(0, areaRng='small', maxDets=100)
stats[10] = _summarize(0, areaRng='medium', maxDets=100)
stats[11] = _summarize(0, areaRng='large', maxDets=100)
and
cocoEval.params.maxDets = [10, 100, 1000]
Hi @shiranakatta @PavlosMelissinos
I feel very sorry about that. I'm occupied with some ridiculous coursework like philosophy and fpgs. And I was asked to write a paper to discuss Rousseau's thought and implement a MIPS ALU simultaneously in a month. I have been too exhausted these days to merge this pr and fix bugs. I wish you all can solve the problem. I'll check the PR and merge it after my hard time.
@shiranakatta it seems that you've seen the pr provided by @PavlosMelissinos . But there is still a bug in the configuration fm-res39zoom.json. The value of TEST_SCALE should be 1300. And the default value of nms_threshold is 1 in test.py.
Darn, I was afraid I might have botched it!
It's good to know the problem; unfortunately I run out of memory on a 1070 for any TEST_SCALE above 1000 (apparently 8GB are not enough).
@voidrank Thank you, I'll report my test results later.
Take your time and good luck with your coursework.
@PavlosMelissinos With TEST_SCALE = 1300
I also run out of memory on a TITAN X (12GB). Recompiling caffe without setting USE_CUDNN := 1
appeared to solve this problem. However, just in case something wrong happens again during the test, I now save results after every 500 images.
Hi, I just want to inform you that I've successfully reproduced your results.
Actually, I even got slightly better AR@1k.
Box proposals
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.120
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 100 ] = 0.195
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 100 ] = 0.128
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.047
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.142
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.221
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.225
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.431
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.576
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.195
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.540
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.675
Segmentation proposals
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.078
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 100 ] = 0.170
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 100 ] = 0.059
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.028
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.093
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.150
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.169
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 100 ] = 0.313
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.409
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.126
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.401
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.504
Many thanks for your excellent work.
I've managed to reproduce the same results as well. Thanks @shiranakatta for your contribution.
@voidrank I have updated the PR with the discussed modifications. Let me know if you need anything more when you have time to check it out. Good luck with your studies.