Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics

Question

Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics

yair-schiff opened this issue a year ago · 1 comments

Hi @XiangLi1999,

Thank you for open sourcing this work!

I am trying to reproduce the results from Table 5 - the infilling experiment. Specifically, I was wondering where the CIDEr and BLEU-4 scores come from and how they are computed? On the aNLG leaderboard, I don't see those metrics reported

Any guidance you can provide here will be much appreciated.

Thanks!

Answer 1 · 2023-03-30T02:59:55.000Z

Hi Yair,

Thanks for reaching out!

We compute these two scores because it’s also reported in https://arxiv.org/pdf/2202.11705.pdf (which is our primary baseline of comparison).

We compute it via evaluation scripts released along with the e2e benchmark. https://github.com/tuetschek/e2e-metrics

Best,
Lisa