fartashf/vsepp

meanr and rsum seem inversely correlated

BigRedT opened this issue · 3 comments

Hi @fartashf,

Great code base, very easy to work with! I had two quick questions regarding evaluation metrics:

  • I noticed that meanr increases as rsum increases and was wondering if you had an explanation for this? (See plots below) I should mention that these results are with a few modifications to your code. Specifically, I used Glove embeddings where the embeddings are not trained / "backproped" into.

  • Also was wondering what was the reason for choosing rsum for model selection instead of meanr?

screen shot 2019-01-17 at 12 48 21 pm

screen shot 2019-01-17 at 12 48 43 pm

Thanks!

Thanks for using the code.

This is not unexpected, but the difference between the min and max of meanr (~7 in rank) might be too much and worth further inspection. I suggest running a clean clone and comparing the plots. I'm not absolutely certain but I don't think you would see a gap more than 2.

It is expected because the MH loss in VSE++ optimizes R@1. If you care about the average rank, the SH loss in VSE0 is more appropriate. meanr is an average rank for all retrieved items while rsum is the sum of R@1+R@5+R@10. meanr is affected by outliers. If we retrieve an item at 1000 and improve the model and get it to 900, meanr is slightly improved while rsum is not.

In hindsight, it might have been more appropriate to use R@1 for early stopping instead of rsum. rsum was originally used for model selection in the UVS code.

For reference, these lines compute the metrics:

# Compute metrics
and

vsepp/train.py

Line 202 in 226688a

currscore = r1 + r5 + r10 + r1i + r5i + r10i

Thanks that was very helpful!

Just to confirm though - meanr is the average rank of the ground truth in the retrieved list. In case there are multiple ground truths (e.g. 5 captions that match the same image) you take the one with the min rank. And the average is taken across all queries. Is that right?

Yes. Only the minimum rank is caption retrieved with the lowest rank is used for evaluation.