google-research/xtreme

Unable to reproduce the reported numbers for XNLI and PAWS-X with mBERT.

somani-iitb opened this issue · 4 comments

Hi,

We are trying to reprodue the numbers reported in the XTREME paper for mBERT on XNLI and PAWS-X tasks quoted (https://arxiv.org/pdf/2003.11080.pdf) in Table 12 and Table 15.

The hyperparameter setting used by us are same as reported in the paper:

07/09/2021 11:46:56 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/home/ec2-user/xtreme/download//xnli', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_predict=True, do_predict_dev=False, do_train=True, eval_all_checkpoints=True, eval_test_set=True, evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=4, init_checkpoint=None, learning_rate=2e-05, local_rank=-1, log_file='train', logging_steps=50, max_grad_norm=1.0, max_seq_length=128, max_steps=-1, model_name_or_path='bert-base-multilingual-cased', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=2.0, output_dir='/home/ec2-user/xtreme/outputs-temp//xnli/bert-base-multilingual-cased-LR2e-5-epoch2-MaxLen128//', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, predict_languages='ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh', save_only_best_checkpoint=True, save_steps=100, seed=42, server_ip='', server_port='', task_name='xnli', test_split='test', tokenizer_name='', train_language='en', train_split='train', warmup_steps=0, weight_decay=0.0)

 ======= Predict using the model from /home/ec2-user/xtreme/outputs-temp//xnli/bert-base-multilingual-cased-LR2e-5-epoch2-MaxLen128/checkpoint-best for test:
ar=0.3854291417165669
bg=0.39500998003992016
de=0.39401197604790417
el=0.38582834331337323
en=0.36966067864271457
es=0.38862275449101796
fr=0.3852295409181637
hi=0.4688622754491018
ru=0.3972055888223553
sw=0.37684630738522956
th=0.5409181636726547
tr=0.418562874251497
ur=0.47345309381237527
vi=0.40738522954091816
zh=0.4405189620758483
total=0.4151696606786427

System configuration:

Linux machine 1 GPUs, x86_64architecture, 250 GB storage, 61 GB RAM.

On XLNI our average numbers are off by ~37% as compared to 79.2 reported in the paper. We saw similar discrepancy on PAWS-X results. Further, with xlm-roberta model we are getting a test performance of 100% while validation numbers are in the range of 50-60%

Can you please suggest what could be the reason for such discrepancy ?

I also met a similar issue. The XLNI number for De on the test set is 36.28% with xlmr encoder. Still not sure which step I misoperate.

niwic commented

I have also had similar issues, and wasted a lot of time on it, until i found this closed issue: #24. Apparently the real labels have been removed from the test data by the preprocessing script. I guess this is to prevent malicious submissions to the benchmark. It's easy enough to get your hands on the real labels though (ex. by modifying the script), so I dont really understand the rationale. This is probably why you are getting results close to 1/3 at least on XNLI, as there are three labels and the test data only has one fake label. Hope this helps.

Hi, @niwic is correct. We mention that we remove the test labels in the README in this section. Closing this for now.

niwic commented

Ah, my bad, I must have missed that in the README somehow.