Higher evaluation scores than the ones in the paper

Question

Higher evaluation scores than the ones in the paper

Closed this issue 8 years ago · 9 comments

Hi,

I ran your evaluation scripts on the val2014 set of MS-COCO as per your instructions (I used your pretrained weights, I didn't train a model myself)

python test.py [gpu_id] [model] [--init_weights=xxx.caffemodel]
python evalCOCO.py [model]

and got the following results.

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.204
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.347
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.213
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.035
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.229
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.450
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.086
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.277
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.361
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.493
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.659

These seem to be even better than the reported values in the technical report (27.7% vs 16.90% for AR@10 and 36.1% vs 31.30% for AR@100).

What am I missing? Is this the wrong way to reproduce your results?

More info:

I set the useCats command line argument in evalCOCO.py to False (True by default, which naturally gives really low scores).
I am using the fm-res39 model.
I removed the useCats parameter in this call of gen_masks in test.py (useCats=args.useCats, vis=args.debug) as it doesn't appear in the definition of gen_masks. I also changed the last parameter name in the function call from "vis" to "images". Is this the intended behaviour? If so, I can prepare a pull request in a few hours.

All in all, I must say you've done excellent work, both the residual neck and the attention module are simple yet elegant ideas; kudos.

Answer 1 · 2017-03-20T15:47:32.000Z

Hi @PavlosMelissinos,
1.The default value of useCats should be False and that of useSegm should be True. The result you provided is based on box-level criteria. I'm very sorry that I got the order back by mistakes.
2. About point 3, you are right. I'll be very pleased if you could submit a PR.
Feel more than happy to see your interest and participation.

Answer 2 · 2017-03-20T16:01:51.000Z

It seems that you were using fm-res39 instead of fm-res39zoom. But if you want to reproduce our result, you should use the zoom one with the same param.

Answer 3 · 2017-03-21T15:48:26.000Z

Done, check your PR page!

Thanks for your help!

Answer 4 · 2017-03-29T09:11:56.000Z

Hi, thanks to @PavlosMelissinos's pull request #12 I could run a test on the first 5k images of the MS-COCO validation set (val2014). However I could not obtain similar results as reported in your paper.

I ran test as follows

$ python test.py 0 fm-res39zoom --init_weights=fm-res39_final_params.caffemodel
$ python evalCOCO.py fm-res39zoom --max_proposal 1000 --nms_threshold 1

and got these results

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.076
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 100 ] = 0.166
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 100 ] = 0.056
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.013
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.094
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.163
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  10 ] = 0.168
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.289
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.363
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.041
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.400
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.552

When I kept nms_threshold as default, AR@10, AR@100, AR@1k results appeared to be worse

$ python evalCOCO.py fm-res39zoom --max_proposal 1000

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.133
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 100 ] = 0.088
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.017
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.150
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.304
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  10 ] = 0.203
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.287
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.317
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.049
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.403
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.525

What did I do wrong? Can you tell me the right way to reproduce your results?

P/S: I changed cocoEval.summarize() code so that it can output AR@10, AR@100, AR@1k.

def _summarizeDets():
    stats = np.zeros((12,))
    stats[0] = _summarize(1)
    stats[1] = _summarize(1, iouThr=.5, maxDets=100)
    stats[2] = _summarize(1, iouThr=.75, maxDets=100)
    stats[3] = _summarize(1, areaRng='small', maxDets=100)
    stats[4] = _summarize(1, areaRng='medium', maxDets=100)
    stats[5] = _summarize(1, areaRng='large', maxDets=100)
    stats[6] = _summarize(0, maxDets=10)
    stats[7] = _summarize(0, maxDets=100)
    stats[8] = _summarize(0, maxDets=1000)
    stats[9] = _summarize(0, areaRng='small', maxDets=100)
    stats[10] = _summarize(0, areaRng='medium', maxDets=100)
    stats[11] = _summarize(0, areaRng='large', maxDets=100)

and

cocoEval.params.maxDets = [10, 100, 1000]

Answer 5 · 2017-03-29T09:34:11.000Z

Hi @shiranakatta @PavlosMelissinos

I feel very sorry about that. I'm occupied with some ridiculous coursework like philosophy and fpgs. And I was asked to write a paper to discuss Rousseau's thought and implement a MIPS ALU simultaneously in a month. I have been too exhausted these days to merge this pr and fix bugs. I wish you all can solve the problem. I'll check the PR and merge it after my hard time.

@shiranakatta it seems that you've seen the pr provided by @PavlosMelissinos . But there is still a bug in the configuration fm-res39zoom.json. The value of TEST_SCALE should be 1300. And the default value of nms_threshold is 1 in test.py.

Answer 6 · 2017-03-29T16:55:22.000Z

Darn, I was afraid I might have botched it!

It's good to know the problem; unfortunately I run out of memory on a 1070 for any TEST_SCALE above 1000 (apparently 8GB are not enough).

Answer 7 · 2017-03-30T03:08:34.000Z

@voidrank Thank you, I'll report my test results later.
Take your time and good luck with your coursework.

@PavlosMelissinos With TEST_SCALE = 1300 I also run out of memory on a TITAN X (12GB). Recompiling caffe without setting USE_CUDNN := 1 appeared to solve this problem. However, just in case something wrong happens again during the test, I now save results after every 500 images.

Answer 8 · 2017-03-30T05:10:04.000Z

Hi, I just want to inform you that I've successfully reproduced your results.
Actually, I even got slightly better AR@1k.

Box proposals

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.120
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 100 ] = 0.195
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 100 ] = 0.128
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.047
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.142
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.221
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  10 ] = 0.225
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.431
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.576
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.195
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.540
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.675

Segmentation proposals

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.078
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets= 100 ] = 0.170
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets= 100 ] = 0.059
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.028
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.093
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.150
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  10 ] = 0.169
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 100 ] = 0.313
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.409
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets= 100 ] = 0.126
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 100 ] = 0.401
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 100 ] = 0.504

Many thanks for your excellent work.

Answer 9 · 2017-03-31T14:48:02.000Z

I've managed to reproduce the same results as well. Thanks @shiranakatta for your contribution.

@voidrank I have updated the PR with the discussed modifications. Let me know if you need anything more when you have time to check it out. Good luck with your studies.