yxuansu/TaCL

unable to reproduce

1024er opened this issue · 7 comments

Hi,
Thank you for sharing code and data.
I try to reproduce your english-pretraining experiment, however, the results are not expected. Will you please provide me some support ? thank you ~
image
In glue-dev evaluations:
1. original bert achieves best results.
2. our trained tacl under-perform released checkpoint by a large margin.

Pretrainning envs: 8x32G V100 / pytorch 1.6-cuda10.2 / wikipedia data downloaded and processed with provided scripts
Finetuing on GLUE envs: 1x2080ti / pytorch 1.6-cuda10.2 / huggingface default settings

Best

Hi,

Thank you for sharing code and data.
I try to reproduce your english-pretraining experiment, however, the results are not expected. Will you please provide me some support ? thank you ~
image
In glue-dev evaluations:

  1. original bert achieves best results.
  2. our trained tacl under-perform released checkpoint by a large margin.
    Pretrainning envs: 8x32G V100 / pytorch 1.6-cuda10.2 / wikipedia data downloaded and processed with provided scripts Finetuing on GLUE envs: 1x2080ti / pytorch 1.6-cuda10.2 / huggingface default settings

Best

Hi,

Thank you for your interest in our paper!

(1) The results seem quite different from ours. When you reproduce the experiments, did you use FP16 for training? For the BERT baseline, did you use the bert-base-uncased or bert-base-cased model? We use the bert-base-uncased to keep it consistent with our TaCL.

(2) The performance difference might be caused by the hardware configuration. For your reference, our pre-training was conducted on an Amazon AWS p3dn.24xlarge instance (https://aws.amazon.com/cn/ec2/instance-types/). For finetuning, our machine's specification is as below:
(1) Ubuntu 16.04.4 LTS; (2) NVIDIA-SMI 430.26; (3) Driver Version: 430.26; (4) CUDA Version: 10.2; (5) GeForce GTX 1080 (12GB);

(3) Did you try to validate the results on SQuAD? We rerun the SQuAD experiments, and the results are perfectly reproduced. I wonder can you get the same results as shown in the Figures below:

SQuAD 1.1:
squad-1 0

SQuAD 2.0:
squad-2 0

(4) Another thing I would like to mention is that, due to computational limitation, when we performed pre-training, we only use the first 20 million lines of the pre-training corpus (~2 GB raw text). This could also be the reason I guess?

Looking forward to your reply :)

As you mentioned "due to computational limitation, when we performed pre-training, we only use the first 20 million lines of the pre-training corpus (~2 GB raw text)", so did you use the same hyper-parameters as listed in the scripts ?
image

As you mentioned "due to computational limitation, when we performed pre-training, we only use the first 20 million lines of the pre-training corpus (~2 GB raw text)", so did you use the same hyper-parameters as listed in the scripts ? image

Yes, the hyperparameters are correct.

image

I used the default hyperparameters and complete enwiki data, and the training curve is a bit strange. I'm not sure what went wrong.

Looking forward to your reply :)

I tried the first 20 million lines of the pre-training corpus (~2 GB raw text) and still unable to reproduce. : (

Looking forward to your reply :)

I tried the first 20 million lines of the pre-training corpus (~2 GB raw text) and still unable to reproduce. : (

Hi,

Did you use exactly the same environment as described in the requirements? We actually found that different transformers or pytorch version could cause some performance discrepancies.

image

I used the default hyperparameters and complete enwiki data, and the training curve is a bit strange. I'm not sure what went wrong.

Your learning curves are not correct. At start, the NSP accuracy should be larger than 90% and MLM accuracy should be around 60%. Did you check your environment setup?