Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics
yair-schiff opened this issue · 1 comments
yair-schiff commented
Hi @XiangLi1999,
Thank you for open sourcing this work!
I am trying to reproduce the results from Table 5 - the infilling experiment. Specifically, I was wondering where the CIDEr and BLEU-4 scores come from and how they are computed? On the aNLG leaderboard, I don't see those metrics reported
Any guidance you can provide here will be much appreciated.
Thanks!
XiangLi1999 commented
Hi Yair,
Thanks for reaching out!
We compute these two scores because it’s also reported in https://arxiv.org/pdf/2202.11705.pdf (which is our primary baseline of comparison).
We compute it via evaluation scripts released along with the e2e benchmark. https://github.com/tuetschek/e2e-metrics
Best,
Lisa