METEOR results do not match NLTK / Wikipedia implementation
jlim13 opened this issue · 5 comments
I ran one example from the wikipedia page for METEOR using the library and got different results.
https://en.wikipedia.org/wiki/METEOR#Examples
gts = ['the cat sat on the mat']
predictions = ['the cat was sat on the mat']
metrics_dict = nlgeval.compute_metrics([gts], predictions)
print (metrics_dict)
yields an answer of
{'Bleu_1': 0.8571428570204083, 'Bleu_2': 0.7559289459014655, 'Bleu_3': 0.6114214173619154, 'Bleu_4': 0.4889230223420641, 'METEOR': 0.5119556177223324, 'ROUGE_L': 0.9360613810741688, 'CIDEr': 0.0}
the example on the wiki page is:
Score: 0.9654 = Fmean: 0.9836 × (1 − Penalty: 0.0185)
Fmean: 0.9836 = 10 × Precision: 0.8571 × Recall: 1.0000 / (Recall: 1.0000 + 9 × Precision: 0.8571)
Penalty: 0.0185 = 0.5 × (Fragmentation: 0.3333 ^3)
Fragmentation: 0.3333 = Chunks: 2.0000 / Matches: 6.0000
The score using this library is 0.59
, versus the score of 0.96
for the wiki page.
I tried your code and our library gives 0.5119556177223324
which is different from what you provided (0.59
) so I am wondering if there is another issue here. That said, it is indeed much lower than what the Wikipedia page says.
The Wikipedia page also has the exact match example producing a score of 0.9977
and our library produces 1
in that case. I think the Wikipedia page only shows a simplified version of the metric because it also is supposed to do other things:
METEOR also includes some other features not found in other metrics, such as synonymy matching, where instead of matching only on the exact word form, the metric also matches on synonyms. For example, the word "good" in the reference rendering as "well" in the translation counts as a match. The metric is also includes a stemmer, which lemmatises words and matches on the lemmatised forms. The implementation of the metric is modular insofar as the algorithms that match words are implemented as modules, and new modules that implement different matching strategies may easily be added.
Still it is difficult to explain that large of a mismatch.
The METEOR implementation that we use comes from https://www.cs.cmu.edu/~alavie/METEOR/ and I just double checked with their jar
file that we get the same results as them.
>>> import nltk
>>> nltk.translate.meteor_score.meteor_score(["the cat sat on the mat"], "the cat was sat on the mat")
0.9653916211293262
>>> nltk.translate.meteor_score.meteor_score(["the cat sat on the mat"], "the cat sat on the mat")
0.9976851851851852
nltk
meteor seems to match Wikipedia. I'll have to look more into this.
Hi Shikhar,
This looks like a version mismatch.
The latest version of Meteor is 1.5, released in 2014 (download here). Its scoring process is described in the paper, "Meteor Universal: Language Specific Translation Evaluation for Any Target Language".
Wikipedia cites the 2005 release and NLTK cites the 2007 release, both of which use different scoring functions.
For the most accurate scores (best correlation with human judgments), use the current Meteor version. See the README for examples.
Best,
Michael
Thanks @mjdenkowski!
@jlim13 This should solve your issue. We use version 1.5, as suggested by the author above, which is why the results are different compared to Wikipedia and NLTK.