XiangLi1999/Diffusion-LM

Reproducing Table 5: Sentence Infilling - CIDEr / BLEU-4 metrics

yair-schiff opened this issue · 1 comments

Hi @XiangLi1999,

Thank you for open sourcing this work!

I am trying to reproduce the results from Table 5 - the infilling experiment. Specifically, I was wondering where the CIDEr and BLEU-4 scores come from and how they are computed? On the aNLG leaderboard, I don't see those metrics reported
Screen Shot 2023-03-28 at 10 15 21 PM
Screen Shot 2023-03-28 at 10 15 42 PM

Any guidance you can provide here will be much appreciated.

Thanks!

Hi Yair,

Thanks for reaching out!

We compute these two scores because it’s also reported in https://arxiv.org/pdf/2202.11705.pdf (which is our primary baseline of comparison).

We compute it via evaluation scripts released along with the e2e benchmark. https://github.com/tuetschek/e2e-metrics

Best,
Lisa