Unable to replicate paper numbers on SQuAD using HF checkpoint

Question

Unable to replicate paper numbers on SQuAD using HF checkpoint

yyw1999 opened this issue 3 years ago · 3 comments

yyw1999 commented 3 years ago

Hello! Thank you for releasing the paper and code for TaCL. I'm running into an issue where I'm unable to replicate the numbers in the paper, using the released HF checkpoint cambridgeltl/tacl-bert-base-uncased (to be exact, I did no pretraining on my side). My results on SQuAD, following the exact package versions listed in requirements.txt, using the default HF QA scripts, and with 8 Tesla K80's, are as follows:

Here are my results for SQuAD v1 and v2:

The EM/F1 (80.8 and 87.96 for V1, and 70.81 and 74.05 for V2) are lower than both the BERT and TaCL numbers in the paper, and I don't think this drop is due to a difference in hardware configurations. I wonder if this could be an issue where the HF repo has changed in the meantime. With the current version of the HF repo (obtained after git clone), maybe TaCL's performance is lower on the QA benchmark? Please advise. Thank you!

Answer 1 · 2022-01-03T14:38:41.000Z

Hello! Thank you for releasing the paper and code for TaCL. I'm running into an issue where I'm unable to replicate the numbers in the paper, using the released HF checkpoint cambridgeltl/tacl-bert-base-uncased (to be exact, I did no pretraining on my side). My results on SQuAD, following the exact package versions listed in requirements.txt, using the default HF QA scripts, and with 8 Tesla K80's, are as follows:

Here are my results for SQuAD v1 and v2:

The EM/F1 (80.8 and 87.96 for V1, and 70.81 and 74.05 for V2) are lower than both the BERT and TaCL numbers in the paper, and I don't think this drop is due to a difference in hardware configurations. I wonder if this could be an issue where the HF repo has changed in the meantime. With the current version of the HF repo (obtained after git clone), maybe TaCL's performance is lower on the QA benchmark? Please advise. Thank you!

Thank you for your interest in our work. The results you got is quite strange, we are not sure why your results are different from ours. We re-run the experiment our side and gets the same results as listed in the paper. Our hardware configurations are:
(1) Ubuntu 16.04.4 LTS; (2) NVIDIA-SMI 430.26; (3) Driver Version: 430.26; (4) CUDA Version: 10.2; (5) GeForce GTX 1080 (12GB);

One potential reason for the results discrepancy might be the number of GPUs. We use a single GPU to do the fine-tuning which tasks ~2 hours for SQuAD 1.1 and ~2.5 hours for SQuAD 2.0. And we did not use the fp16 approximation in our experiment. Please refer to the following figures of our results on SQuAD.

SQuAD 1.1:

SQuAD 2.0:

Could you try to re-run the experiments for both TaCL and BERT using one GPU with the same scripts provided here. Let's see your updated results :)

Answer 2 · 2022-01-04T01:59:04.000Z

Thank you for the quick response! Rerunning the SQuAD V1 experiment on 1 GPU results in slightly worse numbers unfortunately (78.87 EM / 86.45 F1). I didn't use fp16 training at all. Did you rerun the experiments on your end with a fresh copy of transformers? Maybe the version change could account for the discrepancy. Thanks again!

Answer 3 · 2022-01-04T17:12:58.000Z

Thank you for the quick response! Rerunning the SQuAD V1 experiment on 1 GPU results in slightly worse numbers unfortunately (78.87 EM / 86.45 F1). I didn't use fp16 training at all. Did you rerun the experiments on your end with a fresh copy of transformers? Maybe the version change could account for the discrepancy. Thanks again!

Hi,

I just rerun the experiments by cloning the latest version of huggingface and using the same scripts here. I could not replicate your results and my results are still the same as reported in the paper. Please refer to the below screenshots and see the timestamps as a reference.

SQuAD 1.1:

SQuAD 2.0:

I think maybe you can try to install the latest version of huggingface (do not use requirements.txt provided in the repo) and see the results? (Please follow all same steps as described here.)

I hope my response could help you :)