ratishsp/data2text-macro-plan-py

Best-worst scaling

WilliamsToTo opened this issue · 2 comments

Hi,
How exactly did you apply best-worst scaling? And how did you obtain Table 5 in the paper from binary comparisons (Macro against RBF- 2020,4 ED+CC, Gold, and Templ)?

Why do you think best-worst scaling is more reliable? I've never seen this method in NLP papers before.

Hi Tao,
Sorry for the late reply.
Best-worst scaling in combination with binary choice is generally considered more reliable in comparison to likert scale. One main reason is that one doesnt do a ranking of a system in isolation. Thus variance across raters is avoided. The approach is fairly prevalent in NLP. https://www.semanticscholar.org/paper/Best-Worst-Scaling%3A-Theory%2C-Methods-and-Louviere-Flynn/5f4398f0df93ddd548f244b75a49b97f51abd161?sort=pub-date contains few nlp papers as well.

Let us take a simple example of best-worst scaling for three systems A, B and C. Assume there are 10 examples each. Thus each system is compared 20 times. Assume the pairwise comparison scores for A and B are 8 and 2, for B and C are 5 and 5, and for A and C are 6 and 4. We give +1 credit for best and -1 credit for worst. Thus the updated scores are for A and B are +6 and -6, for B and C are 0 and 0, and for A and C are +2 and -2. We sum the scores for each system. The updated scores for A, B and C are +8, -6 and -2. The scores in percentage terms (ie. division by 20) are 40, -30 and -10. Hope it is clear.

Thank you for your clear explanation. It helps a lot.