About parameters
Closed this issue · 20 comments
I used the parameters showed on your paper.
pre-trained BERT:Chinese with Whole Word Masking (WWM)
the maximum length:128
batch size:40 (4X10 GPU cards)
initial learning rate: 0.00005
warm-up steps:1000
optimizer:AdamW
scheduler:WarmupLinearSchedule
epoch:5 (num_train_steps about 80800)
Because of my device (1 * GTX2080Ti), I set train_batch_size = 6000, num_train_steps about 80800. The epoch of the experiment is just 5. The batch size is just 40.
But I can not approach your accuracy, the following picture shows my experiment accuracy.
That's a difference of nearly 3~6 %.
That's my trainning config json:
{ "train_txt_db": "official_train.db", "val_txt_db": "official_dev.db", "test_txt_db": "official_test.db", "out_txt_db": "official_out.db", "sim_txt_db": "official_sim.db", "ran_txt_db": "official_ran.db", "pretrained_model_name_or_path": "hfl/chinese-bert-wwm-ext", "model": "chengyubert-dual", "dataset_cls": "chengyu-masked", "eval_dataset_cls": "chengyu-masked-eval", "output_dir": "storage", "candidates": "combined", "len_idiom_vocab": 3848, "max_txt_len": 128, "train_batch_size": 6000, "val_batch_size": 20000, "gradient_accumulation_steps": 1, "learning_rate": 0.00005, "valid_steps": 100, "num_train_steps": 80800, "optim": "adamw", "betas": [ 0.9, 0.98 ], "adam_epsilon": 1e-08, "dropout": 0.1, "weight_decay": 0.01, "grad_norm": 1.0, "warmup_steps": 1000, "seed": 77, "fp16": true, "n_workers": 0, "pin_mem": true, "location_only": false }
What's wrong with the parameters?
I am not sure if this is due to gradients harvesting with mutliple GPUs.
My suggestion for single card is to feed the GPU with full memory and make gradient_accumulation_steps=5
.
Let's see if that works.
feed the GPU with full memory ?
It means the 'train_batch_size' can use a bigger number than now used ?
Could you please provide the JSON parameter file you used for training? I'll be grateful to you.
I have been updating this repo for a while, the original JSON config does not change much.
But I do recommend the following setting for single GPU,
"train_batch_size": 11000,
"gradient_accumulation_steps": 5,
"num_train_steps": 18000,
I feel sorry for reproducing issues that you encountered, I hope we can find the reason through trials.
I appreciate your kind help. I'll try the experiment again. ♥
num_train_steps = 18000 , train_batch_size = 11000
The epoch is only 2 , not 5. Does it matter?
If we use gradient_accumulation_steps
, for each step, it will use five times examples.
OK! I know it. Thanks. I will try again. Hopefully it will succeed in achieving the desired score.
Yes, I also feel nervous about reproduction, although I tried my code several times.
I hope we can reproduce without difficulty and I will update the parameters for the benefit of all.
Glad that works!
I think larger batch size converges better, this is mainly due to the stocastic gradients accumulated is closer to the full batch case when fitting the dataset.
Well. That's amazing! I learn about that.
What's more, the two-stage training appeared in your codes isn't in detail on your paper.
What's about the two-stage? Will it get a higher score?
For two-stage
, you can directly try Stage-Two. If you are interested in the paper, here is the link Two-Stage.
Glad !
I have another question.
You used 'valid_steps' to get the best score checkpoint. But in some cases, it is coincidental or fortuitous.
From my observations, the valid-dataset accuracy during the training was mostly stable at 79.
Do you want to use K-fold cross validation to get a more convincing score?
For this dataset, the size is large, so cross validation needs more computation. If the goal is to get better performance, this is a way.
In most cases, if the result can support that the method works, we will follow the train-dev-test
which is used in most large scale QA task.
Biases of the dataset can be a seperate research topic.
Oh, so that's it.
Because of my shortcoming, I know this kind of training for the first time. So It may be incredible.
My teachers always used to ask me to use cross validation.
Thank you very much. I am learning a new training form with your codes.
Appreciate your work and kind help, let me learn a lot.
Thank you for say so!
I also learned a lot from the acknowledged repos. I recommend you to try their code also.
I'm sorry to bother you again.
I wanna know whether the codes of paper ( ' A BERT-based two-stage model for Chinese Chengyu recommendation ' about two-stage) are only using ' train_pretrain.py ' and ' train_official.py '?
What's the difference between the stage-1-pretain and using 'train_pretrain. py'?
What's more, What's the difference among w/o Pre-Training 、w/o Fine-Tuning 、 w/o 𝐿V and w/o 𝐿A. (I don't quite understand what you're showing in your paper.)
Could you describe more details? Thanks very much.
How about starting new issues with each question so that I can answer them one by one.
I am suggesting this because this may help others who might have similar questions.
OK !