cocodataset/cocoapi

Observations on the calculations of COCO metrics

RSly opened this issue · 14 comments

RSly commented

Hi,

I have some observations on the coco metrics, specially the precision metric that I would like to share.
it would be great if some could clarify these points :) /cc. @pdollar @tylin

for calculating precision/recall, I am calculating the COCO average precision to get a feeling with respect to the systems result. Also, here for better explaining the issue, I will also calculate these metrics considering all the observations as a whole (say as a large stitched image, and not many separate images), which I call here the overall recall/precision.

Case1. a system with perfect detection + one false alarm: in this case as detailed in the next figure, the coco average precision comes out to be 1.0, which is completely ignoring the false alarm's existence!

image

Case2. a system with zero false alarms: in this case, we have no false alarms, and thus, the overall precision is perfect at 1.0; however, the coco precision comes out as 0.5! This case is very important since it could mean that the coco average precision is penalizing systems with no false alarms, and favoring the detection part of a system in evaluation? As you may know systems with zero/small false alarms are of great importance in industrial applications

image

So I am not sure if the above cases are bugs or are intentionally decided for coco, or if I am missing something?

The computation you are describing is not how average precision is computed. I recommend reading: http://homepages.inf.ed.ac.uk/ckiw/postscript/ijcv_voc09.pdf, section 4.2, or you can find a number of references online. AP is the area under the precision recall curve. For each detection, you have a confidence. You then match detections (ordered by confidence) to ground truth, and for each recall value you get a precision. You then compute an area under this curve. There are a number of subtleties in this computation, but that's the overall idea. Take a look and I hope that answers your questions. Thanks!

RSly commented

thanks @pdollar , I will read the references in depth then :)

I am mainly surprised to get a perfect metrics of 1.0 for the case 1 where we Clearly have a large false alarm!

=> Can we say that the metrics calculated by the coco API (av. recall=1, av. precision=1) don't represent well our system in case 1 with the large false alarm ?

tylin commented

@RSly like Piotr mentioned, the detection score is needed to compute precision/recall curve and average precision, which is clearly missing in your description.
Case 1 is a case that might be confusing for people who just start to use AP metric. I try to give more details below:
All detections are first sorted by scores and then compute hits and misses.
From sorted list of hits and misses, we can compute precision and recall for each detection.
The first detection gets (1.0 recall, 1.0 precision) and the second gets (1.0 recall, 0.5 precision).
You have 1.0 area under the curve when you plot precision and recall curve for these two points.

AP is a metric that averages precision over recall.
In practice, your system will need to work on certain precision and recall point on your PR curve and that determines the score threshold to show detections.
For the case 1, you can find a score threshold that only shows the true positive detection and ignore the false alarm.
In that sense, you can have a perfect detection system for the specific case.

RSly commented

hi @tylin, thanks for the explanation. it is clear now.

however, before closing this issue could you please test the attached json files with both python and Matlab api? coco_problem.zip
I get different results from Python and Matlab as following:

Python:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 1.000

Matlab:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.500
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.500
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.500
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = NaN
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = NaN
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = NaN
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = NaN
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500

Thanks @RSly for pointing this out. We'll take a look when/if we have more time. I suspect this is a divide by 0 kind of situation (and then errors propagating to give different results elsewhere). If that's the case, it should hopefully never happen on real data (where there will be multiple errors and success of every kind). Indeed, using real data we have extensively verified that the results of Matlab/Python code are the same. Still, it useful to have checks for this kind of degenerate case, so if we have more time we'll look into it. Thanks.

RSly commented

@pdollar, thanks for the answer.
unit tests may be of interest here.

The results make sense according to the AP metric and it highly depends on how you rank your detections. If your most confident detection is a true positive and there is only one ground-truth object, then regardless of how many false positives you make, you will get an AP of 1, because that is the area under the Prec-Recall curve.

botcs commented

Hi guys,
I am implementing a unit test for the COCOeval's python api, using a very simple task: I generate 2 white box on a single black plane, and feed the annotations as predictions with 1.0 confidence score.

However I get similar results as @RSly has reported.

I have compiled my results and observations here

In short: if I have N boxes, the precision will be 1-1/N if the recall threshold is <= 1-1/N, otherwise the precision will be 0.

@pdollar, @tylin, @RSly could you please help me out, what could be the issue?

After struggling with a similar issue to @botcs above, I found this comment by @bhfs9999 at the bottom of this gist https://gist.github.com/botcs/5d13a744104ab1fa9fdd9987ea7ff97a which seems to solve the problem.

I wrote a unit test that just had a single image with a single ground truth box, and a single predicted box with perfect overlap and a score of 1.0. I expected the AP to be 1.0, but it was 0.0. After changing the id of the annotation from 0 to 1, the AP changed to 1.0.

botcs commented

@lewfish Indeed!
@qizhuli helped me debugging the issue, and after that, things are working as expected. Just forgot to update the solution here...

@botcs Did that have such an effect on a full dataset? I can imagine it bringing down results for a tiny debuging dataset (2 images, 1 not being used because of the 0-index) - did it mess up a real dataset for you?

botcs commented

@ividal Only in the 1e-4 magnitude.

@RSly like Piotr mentioned, the detection score is needed to compute precision/recall curve and average precision, which is clearly missing in your description.
Case 1 is a case that might be confusing for people who just start to use AP metric. I try to give more details below:
All detections are first sorted by scores and then compute hits and misses.
From sorted list of hits and misses, we can compute precision and recall for each detection.
The first detection gets (1.0 recall, 1.0 precision) and the second gets (1.0 recall, 0.5 precision).
You have 1.0 area under the curve when you plot precision and recall curve for these two points.

AP is a metric that averages precision over recall.
In practice, your system will need to work on certain precision and recall point on your PR curve and that determines the score threshold to show detections.
For the case 1, you can find a score threshold that only shows the true positive detection and ignore the false alarm.
In that sense, you can have a perfect detection system for the specific case.

@tylin The first detection will never get recall at 1.0, unless you only have a single one ground-truth.

Hi guys,
I am implementing a unit test for the COCOeval's python api, using a very simple task: I generate 2 white box on a single black plane, and feed the annotations as predictions with 1.0 confidence score.

However I get similar results as @RSly has reported.

I have compiled my results and observations here

In short: if I have N boxes, the precision will be 1-1/N if the recall threshold is <= 1-1/N, otherwise the precision will be 0.

@pdollar, @tylin, @RSly could you please help me out, what could be the issue?

@botcs Yes, I also found that if you can not reach the 1.0 recall, the precision got by cocoeval code will be 0. And it will make AP decrease dramatically if you only have a small number of region proposals (even if all of them are correct).

In the paper of VOC dataset, I found some words that can prove AP did punish the methods with low false alarm :

The intention in interpolating the precision/recall curvein this way is to reduce the impact of the “wiggles” inthe precision/recall curve, caused by small variations in theranking of examples. It should be noted that to obtain a highscore, a method must have precision at all levels of recall—this penalises methods which retrieve only a subset of examples with high precision (e.g. side views of cars).

[1] Everingham, Mark, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. “The Pascal Visual Object Classes (VOC) Challenge.” International Journal of Computer Vision 88, no. 2 (June 1, 2010): 303–38. https://doi.org/10.1007/s11263-009-0275-4.