meanr and rsum seem inversely correlated
BigRedT opened this issue · 3 comments
Hi @fartashf,
Great code base, very easy to work with! I had two quick questions regarding evaluation metrics:
-
I noticed that meanr increases as rsum increases and was wondering if you had an explanation for this? (See plots below) I should mention that these results are with a few modifications to your code. Specifically, I used Glove embeddings where the embeddings are not trained / "backproped" into.
-
Also was wondering what was the reason for choosing rsum for model selection instead of meanr?
Thanks!
Thanks for using the code.
This is not unexpected, but the difference between the min and max of meanr
(~7 in rank) might be too much and worth further inspection. I suggest running a clean clone and comparing the plots. I'm not absolutely certain but I don't think you would see a gap more than 2.
It is expected because the MH loss in VSE++ optimizes R@1. If you care about the average rank, the SH loss in VSE0 is more appropriate. meanr
is an average rank for all retrieved items while rsum
is the sum of R@1+R@5+R@10. meanr
is affected by outliers. If we retrieve an item at 1000 and improve the model and get it to 900, meanr
is slightly improved while rsum
is not.
In hindsight, it might have been more appropriate to use R@1 for early stopping instead of rsum
. rsum
was originally used for model selection in the UVS code.
For reference, these lines compute the metrics:
Line 246 in 226688a
Line 202 in 226688a
Thanks that was very helpful!
Just to confirm though - meanr is the average rank of the ground truth in the retrieved list. In case there are multiple ground truths (e.g. 5 captions that match the same image) you take the one with the min rank. And the average is taken across all queries. Is that right?
Yes. Only the minimum rank is caption retrieved with the lowest rank is used for evaluation.