CIDEr not working with multiple reference strings (or am I not getting it?)

Question

CIDEr not working with multiple reference strings (or am I not getting it?)

MaximusWhite opened this issue 4 years ago · 2 comments

I was trying to get CIDEr metric to work, but its behaviour is rather confusing.
Here's my code:

nlgeval.compute_individual_metrics(["coffee and buns", 
                                   "tea and croissants",
                                   "pop and snacks"], "tea and buns")

Here's the output:

{
 'Bleu_1': 0.9999999993333338,
 'Bleu_2': 0.9999999992500005,
 'Bleu_3': 9.999999990555565e-06,
 'Bleu_4': 5.623413247451623e-06,
 'METEOR': 0.272954092584186,
 'ROUGE_L': 0.6666666666666666,
 'CIDEr': 0.0,
 'SkipThoughtCS': 0.92198694,
 'EmbeddingAverageCosineSimilarity': 0.895264,
 'EmbeddingAverageCosineSimilairty': 0.895264,
 'VectorExtremaCosineSimilarity': 0.843911,
 'GreedyMatchingScore': 0.889737
}

The CIDEr score is 0, even though according to documentation:

object oriented API for repeated calls in a script - single example
from nlgeval import NLGEval
nlgeval = NLGEval()  # loads the models
metrics_dict = nlgeval.compute_metrics(references, hypothesis)
where references is a list of ground truth reference text strings and hypothesis is the hypothesis text string.

So, to me it sounds like simply having multiple reference strings and a single hypothesis string should be enough...
Am I missing something?

Now, I noted the other point in the documentation:

CIDEr by default (with idf parameter set to "corpus" mode) computes IDF values using the reference sentences provided. Thus, CIDEr score for a reference dataset with only 1 image (or example for NLG) will be zero.

I'm not sure I'm following what this is talking about. What does it mean by "dataset with only 1 image"? I thought it works with reference sentences, not images.
Is "1 image" supposed to correspond to a particular number of reference sentences?

When evaluating using one (or few) images, set idf to "coco-val-df" instead, which uses IDF from the MSCOCO Vaildation Dataset for reliable results.

Also, where can I set that idf parameter? Is that something that can be done via NLGEval object?

I'd really appreciate a clarification on this as I might just not understand how CIDEr works or how to use it properly.

Answer 1 · 2020-11-11T23:12:53.000Z

Looks like you are using it correctly. As you can see in our examples, the CIDEr score is also 0 with just 1 reference: https://github.com/Maluuba/nlg-eval/blob/master/nlgeval/tests/test_nlgeval.py

Sorry about that confusing note. I think it's because CIDEr was originally made for evaluating text that was generated based on images so this note was copied/adapted from something else. In our case, 1 image = 1 reference.

I'm not too sure about the IDF stuff. I think the note is implying that if you really want to use the compute_individual_metrics method and if you really want CIDEr to work with 1 or few examples, then you can use the files from that link and copy them into this repo. https://github.com/Maluuba/nlg-eval/tree/master/nlgeval/pycocoevalcap matches https://github.com/vrama91/coco-caption/tree/master/pycocoevalcap .

BTW you're calling the compute_individual_metrics but your snippet "from the docs" says compute_metrics. Maybe that's not from the latest docs?

Answer 2 · 2020-12-23T16:46:59.000Z

The MS COCO evaluation code which we build upon supports only CIDEr-D and in corpus mode only. The score of 0 is expected behavior here. To try out other modes, you'll have to use https://github.com/vrama91/coco-caption/tree/master/pycocoevalcap.

clarification: 1 image here refers to 1 NLG sample (all references + hypothesis for a single data point)