The shown average score is different with the manual calculation of mean of individual scores
BaohaoLiao opened this issue · 5 comments
Hi,
I found a bug of the leaderboard score. When we submit a model, we can get the average score and the individual scores for each language pair. Even concerning the left digits after the second decimal place, they are not matched. For my model (https://dynabench.org/models/250), the shown average score for "Leaderboard Datasets" is 27.59. However, I calculate the average by myself, it should be 28.21. It is the same for "Non-Leaderboard Datasets", the shown one is 27.89, and my calculation is 28.50. Could you check this for me? If it is true, then all of the shown scores on the leaderboard are misleading.
Or you calculate some kind of weighted average.
Thanks for reporting,
could you share the screenshot of the result page ? I don't have access to it until you make your model public.
Thanks for reporting,
could you share the screenshot of the result page ? I don't have access to it until you make your model public.
@gwenzek I have published https://dynabench.org/models/250, the name is task2-615m (baohao).
Hi @gwenzek, I find all models from my submissions have the same problem. Have you found the reason?
Hi @BaohaoLiao the score that you're seeing is IIUC the "corpus BLEU" which would be the BLEU across all the concatenated dataset. This is not the score we intended to show in the leaderboard, so thanks for catching this.
I'll fix that later today.
Hi @BaohaoLiao , I just fixed it :-) The scores are the same that what you computed now: