The shown average score is different with the manual calculation of mean of individual scores

Question

The shown average score is different with the manual calculation of mean of individual scores

BaohaoLiao opened this issue 3 years ago · 5 comments

Hi,

I found a bug of the leaderboard score. When we submit a model, we can get the average score and the individual scores for each language pair. Even concerning the left digits after the second decimal place, they are not matched. For my model (https://dynabench.org/models/250), the shown average score for "Leaderboard Datasets" is 27.59. However, I calculate the average by myself, it should be 28.21. It is the same for "Non-Leaderboard Datasets", the shown one is 27.89, and my calculation is 28.50. Could you check this for me? If it is true, then all of the shown scores on the leaderboard are misleading.

Or you calculate some kind of weighted average.

Answer 1 · 2021-07-29T12:47:47.000Z

Thanks for reporting,
could you share the screenshot of the result page ? I don't have access to it until you make your model public.

Answer 2 · 2021-07-29T13:17:57.000Z

Thanks for reporting,
could you share the screenshot of the result page ? I don't have access to it until you make your model public.

@gwenzek I have published https://dynabench.org/models/250, the name is task2-615m (baohao).

Answer 3 · 2021-07-30T12:57:50.000Z

Hi @gwenzek, I find all models from my submissions have the same problem. Have you found the reason?

Answer 4 · 2021-08-03T15:29:17.000Z

Hi @BaohaoLiao the score that you're seeing is IIUC the "corpus BLEU" which would be the BLEU across all the concatenated dataset. This is not the score we intended to show in the leaderboard, so thanks for catching this.
I'll fix that later today.

Answer 5 · 2021-08-03T15:48:05.000Z

Hi @BaohaoLiao , I just fixed it :-) The scores are the same that what you computed now: