google-research/xtreme

Evaluation results of PANX task

sakuraimai opened this issue · 3 comments

Hi, thank you for sharing this amazing dataset.
I have a question about the eval results of panx task.
I used en subset for training XLM-R, and all available language subsets for prediction and evaluation.
Although it looks like getting good results during training and prediction('test_{lang}_prediction.txt'), evaluation results on standard output shows 'f1 = 0.0' in all languages, even in English, which I used for training.
Is there any idea to solve this situation?

Evaluation results during training:

07/11/2022 16:56:11 - INFO - __main__ -   ***** Evaluation result best in en *****
07/11/2022 16:56:11 - INFO - __main__ -     f1 = 0.8004747277296844
07/11/2022 16:56:11 - INFO - __main__ -     loss = 0.27271114725369616
07/11/2022 16:56:11 - INFO - __main__ -     precision = 0.7906495655771618
07/11/2022 16:56:11 - INFO - __main__ -     recall = 0.8105471511381309

Evaluation results in en, fr

07/11/2022 17:00:21 - INFO - __main__ -   ***** Evaluation result  in en *****
07/11/2022 17:00:21 - INFO - __main__ -     f1 = 0.0
07/11/2022 17:00:21 - INFO - __main__ -     loss = 3.371383772871365
07/11/2022 17:00:21 - INFO - __main__ -     precision = 0.0
07/11/2022 17:00:21 - INFO - __main__ -     recall = 0.0
07/11/2022 17:02:17 - INFO - __main__ -   ***** Evaluation result  in fr *****
07/11/2022 17:02:17 - INFO - __main__ -     f1 = 0.0
07/11/2022 17:02:17 - INFO - __main__ -     loss = 3.6016474927957067
07/11/2022 17:02:17 - INFO - __main__ -     precision = 0.0
07/11/2022 17:02:17 - INFO - __main__ -     recall = 0.0

same question

Same issue. How about this?

As described here, labels in the test data are automatically removed during preprocessing to prevent accidental cheating, which is why evaluation on the test data shows 0 scores for all languages.
We recommend to only evaluate on the validation data and upload test predictions using the submission form once you would like to submit your model results.