Why is the reproduction result on English benchmark lower than that in the paper?

Question

Why is the reproduction result on English benchmark lower than that in the paper?

wpwpwpyo opened this issue 3 years ago · 2 comments

Why is the reproduction result on English benchmark lower than that in the paper?Especially CoLA,STS-B and QQP.Cloud you please show the parameter configuration of sh files of fine-tuning GLUE and SQuAD?

Answer 1 · 2021-12-09T10:54:13.000Z

Why is the reproduction result on English benchmark lower than that in the paper?Especially CoLA,STS-B and QQP.Cloud you please show the parameter configuration of sh files of fine-tuning GLUE and SQuAD?

Hi,

Thank you for your interest in our paper. For GLUE and SQuAD, We use the default scripts provided by the huggingface team here (https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification) and here (https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering). It should be noted that different machines or hardware configurations could lead to slightly variant results. For your reference, the configuration of our machine for the English benchmarks is:
(1) Ubuntu 16.04.4 LTS; (2) NVIDIA-SMI 430.26; (3) Driver Version: 430.26; (4) CUDA Version: 10.2; (5) GeForce GTX 1080 (12GB);

Feel free to ask if you have further questions :)

Answer 2 · 2021-12-09T11:06:31.000Z

Why is the reproduction result on English benchmark lower than that in the paper?Especially CoLA,STS-B and QQP.Cloud you please show the parameter configuration of sh files of fine-tuning GLUE and SQuAD?

SQuAD 1.1:

SQuAD 2.0:

We have rerun the experiments on SQuAD, please see the results as a reference.