CIDEr not working with multiple reference strings (or am I not getting it?)
MaximusWhite opened this issue · 2 comments
I was trying to get CIDEr metric to work, but its behaviour is rather confusing.
Here's my code:
nlgeval.compute_individual_metrics(["coffee and buns",
"tea and croissants",
"pop and snacks"], "tea and buns")
Here's the output:
{
'Bleu_1': 0.9999999993333338,
'Bleu_2': 0.9999999992500005,
'Bleu_3': 9.999999990555565e-06,
'Bleu_4': 5.623413247451623e-06,
'METEOR': 0.272954092584186,
'ROUGE_L': 0.6666666666666666,
'CIDEr': 0.0,
'SkipThoughtCS': 0.92198694,
'EmbeddingAverageCosineSimilarity': 0.895264,
'EmbeddingAverageCosineSimilairty': 0.895264,
'VectorExtremaCosineSimilarity': 0.843911,
'GreedyMatchingScore': 0.889737
}
The CIDEr score is 0, even though according to documentation:
object oriented API for repeated calls in a script - single example
from nlgeval import NLGEval nlgeval = NLGEval() # loads the models metrics_dict = nlgeval.compute_metrics(references, hypothesis)where
references
is a list of ground truth reference text strings andhypothesis
is the hypothesis text string.
So, to me it sounds like simply having multiple reference strings and a single hypothesis string should be enough...
Am I missing something?
Now, I noted the other point in the documentation:
CIDEr by default (with idf parameter set to "corpus" mode) computes IDF values using the reference sentences provided. Thus, CIDEr score for a reference dataset with only 1 image (or example for NLG) will be zero.
I'm not sure I'm following what this is talking about. What does it mean by "dataset with only 1 image"? I thought it works with reference sentences, not images.
Is "1 image" supposed to correspond to a particular number of reference sentences?
When evaluating using one (or few) images, set idf to "coco-val-df" instead, which uses IDF from the MSCOCO Vaildation Dataset for reliable results.
Also, where can I set that idf parameter? Is that something that can be done via NLGEval
object?
I'd really appreciate a clarification on this as I might just not understand how CIDEr works or how to use it properly.
Looks like you are using it correctly. As you can see in our examples, the CIDEr score is also 0 with just 1 reference: https://github.com/Maluuba/nlg-eval/blob/master/nlgeval/tests/test_nlgeval.py
Sorry about that confusing note. I think it's because CIDEr was originally made for evaluating text that was generated based on images so this note was copied/adapted from something else. In our case, 1 image = 1 reference.
I'm not too sure about the IDF stuff. I think the note is implying that if you really want to use the compute_individual_metrics
method and if you really want CIDEr to work with 1 or few examples, then you can use the files from that link and copy them into this repo. https://github.com/Maluuba/nlg-eval/tree/master/nlgeval/pycocoevalcap matches https://github.com/vrama91/coco-caption/tree/master/pycocoevalcap .
BTW you're calling the compute_individual_metrics
but your snippet "from the docs" says compute_metrics
. Maybe that's not from the latest docs?
The MS COCO evaluation code which we build upon supports only CIDEr-D
and in corpus
mode only. The score of 0 is expected behavior here. To try out other modes, you'll have to use https://github.com/vrama91/coco-caption/tree/master/pycocoevalcap.
clarification: 1 image here refers to 1 NLG sample (all references + hypothesis for a single data point)