Aleph-Alpha/magma

how did you calculate the bleu score

Closed this issue · 1 comments

Hi, thanks for the awesome project.
I noticed that the reported BLEU@4 and CIDEr scores in Table 1 are ~10 and ~50 on the MS COCO dataset(zero-shot, after fine-tuning the scores increase to 31 and 90+), respectively, which fall far behind traditional baselines like AoA and CLIP-ViL(they usually achieve ~40 BLEU-4 and 120+ CIDEr).
I am wondering whether the difference is due to the evaluation setup, did you use the evaluation in coco-caption or calculate the scores yourself?

Hi,

we evaluated the scores ourselves using this code https://github.com/Maluuba/nlg-eval to calculate the BLEU and CIDEr metrics.

Cheers,

Constantin