google-research/xtreme

Possible error in structure prediction tasks

tonytan48 opened this issue · 2 comments

Hi xtreme team,
Thank you for your work on proposing the leaderboard. However, it seems the evaluation mode reported for UDPOS is inconsistent with the current release in the code. According to Table.20 POS accuracy results in the paper https://arxiv.org/pdf/2003.11080.pdf. The evaluation metric for POS is accuracy, and the average result for XLM-R is 73.8. However, in the code third_party/run_tag.py it only imports f1_score-related measurements from seqeval and the default eval for UDPOS is actually f1 score. I reproduced the experiment on UDPOS and used different measurements on the test set(Sorry I used leaked test set on my local machine for quicker evaluation). By default script and XLM-R large, I can get average of 74.2 f1_score, which is in line with 73.8 reported. For English, the F1 score is 96.15. However if I evaluate with accuracy score. I got 96.7 accuracy and 78.23 on average. Hence I suspect the evaluation on the leaderboard and in the paper for UDPOS is actually f1 score. Could you help to address this issue? I have a reproduced experiment result here: https://docs.google.com/spreadsheets/d/16Cv0IIdZGOyx6xUawcKScb38Cl3ofy0tHJSdWrt07LI/edit?usp=sharing

Thanks for flagging this, @tonytan48. @JunjieHu, could you take a look?

@sebastianruder Thanks for the prompt reply. I noticed that in the main table Table 2. The metric for POS is F1. So maybe its just a typo in table 20. Out of curiosity, seems the evaluation metric for POS is mostly accuracy in previous works. Is there some intuition for you to use F1 ?