OpenNLPLab/AVSBench

IoU & F1-score computation

Devin-Pi opened this issue · 3 comments

Hi, thanks for your wonderful work firstly!

After reading your code, I have found there is an interesting thing but I cannot understand.

In your code of IoU Calculation, the totally black GTs has also been involved in calculation. The final result actually is the ratio of the predicted image and the whole image. However, when calculating the F1-score, the totally black GTs has been removed. I would like to know if there is any explanations about the way of IoU and F1-score calculation.

Thanks for your amazing work again.
Looking forward to your reply!

Hi, thanks for your attention. The F-score ignores the GTs that are entirely black and only reflects the predictions of video frames containing meaningful objects. You can definitely include the black GTs when calculating the F-score, just ensure that all comparison methods use the same evaluation strategy.

Hi, thanks for your attention. The F-score ignores the GTs that are entirely black and only reflects the predictions of video frames containing meaningful objects. You can definitely include the black GTs when calculating the F-score, just ensure that all comparison methods use the same evaluation strategy.

Thanks for your reply!

One more thing that I want to ask is that why the totally black GTs have been involved in the IoU calculation. I mean that if there is no sound emitting objects in this frame just like the GTs show, the prediction mask is expected to be totally black, is it right? So why take the predicted masks into the IoU calculation even though the corresponding GT is totally black.
Thank you!
Looking forward to your reply!

The totally black GT indicates there are no objects making sounds in the video frames. However, an AVS model may incorrectly segment some objects during inference due to not fully understanding the matching of the objects and current sound, namely the prediction is not black. Therefore, the GT with all black should be considered to examine whether an AVS model over-segments some pixels.