Question regarding release of best current model
atreyasha opened this issue · 5 comments
Hello Google Research Team,
Thank you for this awesome repo and for the baseline code. As part of a downstream task in machine translation, I require a well-performing model on the PAWS-X dataset. I have been attempting to fine-tune some models using the code here, but my test accuracies on PAWS-X are still in the mid 50's.
I was wondering when the current best performing XLM-R model would be released for downstream usage?
Thank you.
Hi,
Have you been fine-tuning XLM-R or which model have you been fine-tuning to achieve a comparatively low performance on PAWS-X?
We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task. Even though fine-tuned models may be helpful to some extent for certain downstream tasks (see for instance this recent paper) we believe that the original pre-trained models should generally be used as the starting point for further experiments.
Hi @sebastianruder, thank you for the quick response.
Fine-tuning instances
As I just started fine-tuning recently, I have only managed to test a few instances. Essentially, I set up this repo (as per the readme) and ran the following commands with the corresponding best (train/dev/test) results:
bash scripts/train.sh bert-base-multilingual-cased pawsx
eval_results
cc_best = 0.9374687343671836
num_best = 1999
correct_best = 1874
acc_MaxLen128 = 0.9349674837418709
num_MaxLen128 = 1999
correct_MaxLen128 = 1869
...
eval_test_results
...
======= Predict using the model from checkpoint-400:
de=0.5757878939469735
en=0.5462731365682841
es=0.6068034017008505
fr=0.5927963981990996
ja=0.688344172086043
ko=0.742871435717859
zh=0.5957978989494748
total=0.6212391910240834
...
bash scripts/train.sh xlm-roberta-base pawsx
eval_results
acc_best = 0.9374687343671836
num_best = 1999
correct_best = 1874
acc_MaxLen128 = 0.9309654827413707
num_MaxLen128 = 1999
correct_MaxLen128 = 1861
...
eval_test_results
...
======= Predict using the model from checkpoint-2800:
de=0.575287643821911
en=0.545272636318159
es=0.5732866433216608
fr=0.5737868934467234
ja=0.6588294147073537
ko=0.7218609304652326
zh=0.6068034017008505
total=0.6078753662545558
...
-
bash scripts/train.sh xlm-roberta-large pawsx
I cut this training short because the training and dev losses were not changing, and the test accuracy stayed fixed at
1.0
, which seems off. Not sure why this was so.
Do you think these results were due to an "unlucky" initial configuration and that I should re-run these commands a few more times? I believe there is still some stochasticity here with the optimizer despite the model being loaded to a fixed initial checkpoint.
Fine-tuning models
We are not currently planning to release fine-tuned models as this would mean we would need to release one model for each task.
I understand. Hmm, would it be possible to share the hyperparameters and training arguments that were used to fine-tune the best performing model so far for PAWS-X (eg. for those listed on the leaderboard)? I could perhaps try to reproduce the best model with those values.
Just to add more information for this issue:
I ran bash scripts/train.sh xlm-roberta-large pawsx
for XLM-R (large) again but this time reduced the default learning rate from LR=2e-5
to LR=1e-6
in train_pawsx.sh
(in contrast to the learning rate of 3e-5
mentioned in the XTREME paper Appendix B). Based on this, here is a snippet of the current results in eval_test_results
.
======= Predict using the model from checkpoint-200:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0
======= Predict using the model from checkpoint-400:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0
======= Predict using the model from checkpoint-600:
de=1.0
en=1.0
es=1.0
fr=1.0
ja=1.0
ko=1.0
zh=1.0
total=1.0
======= Predict using the model from checkpoint-800:
de=0.870935467733867
en=0.6323161580790395
es=0.7293646823411706
fr=0.840920460230115
ja=0.9354677338669335
ko=0.9459729864932466
zh=0.8099049524762382
total=0.8235546344600871
======= Predict using the model from checkpoint-1000:
de=0.6728364182091046
en=0.39419709854927465
es=0.542271135567784
fr=0.6188094047023511
ja=0.8344172086043021
ko=0.8659329664832416
zh=0.6953476738369184
total=0.6605445579932824
Observations and questions
In checkpoint-800
, the highest test accuracy appears to have been achieved and is in the ballpark with the current highest score in the XTREME leaderboard for sentence classification. However, the model at checkpoint-800
would not have been considered the best model since the dev accuracies keep rising after that.
Is there a reason why the first few checkpoints (above) all show an accuracy of 1.0
? Strangely, in all the checkpoints where test accuracy was 1.0
, the dev accuracies were fixed at a value of 0.568784392196098
. I am not sure if this is some error or if the test sets were classified completely correctly.
Hi @atreyasha , the testing score is around 0.5 because we remove the true class label for each example in the test set, and use a fake placeholder ("0") as the label for all the testing examples. Because we want to encourage the users to get their predictions and submit to our XTREME benchmark. Removing the label is intentional in order to avoid trivial submissions. At the beginning of the training, the model is not well trained and predict all zeros for all the examples. That's why you would observe testing accuracy to be 1 at the beginning.
If you want to use the best system for PAWS-X, you may refer to the dev set accuracy which is accurate, and we also find that the scores on the dev and test sets are well correlated.