VisualJoyce/ChengyuBERT

About parameters

Closed this issue · 20 comments

I used the parameters showed on your paper.
image

pre-trained BERT:Chinese with Whole Word Masking (WWM)
the maximum length:128
batch size:40 (4X10 GPU cards)
initial learning rate: 0.00005
warm-up steps:1000
optimizer:AdamW
scheduler:WarmupLinearSchedule
epoch:5 (num_train_steps about 80800)

Because of my device (1 * GTX2080Ti), I set train_batch_size = 6000, num_train_steps about 80800. The epoch of the experiment is just 5. The batch size is just 40.

But I can not approach your accuracy, the following picture shows my experiment accuracy.
image
That's a difference of nearly 3~6 %.
image

That's my trainning config json:
{ "train_txt_db": "official_train.db", "val_txt_db": "official_dev.db", "test_txt_db": "official_test.db", "out_txt_db": "official_out.db", "sim_txt_db": "official_sim.db", "ran_txt_db": "official_ran.db", "pretrained_model_name_or_path": "hfl/chinese-bert-wwm-ext", "model": "chengyubert-dual", "dataset_cls": "chengyu-masked", "eval_dataset_cls": "chengyu-masked-eval", "output_dir": "storage", "candidates": "combined", "len_idiom_vocab": 3848, "max_txt_len": 128, "train_batch_size": 6000, "val_batch_size": 20000, "gradient_accumulation_steps": 1, "learning_rate": 0.00005, "valid_steps": 100, "num_train_steps": 80800, "optim": "adamw", "betas": [ 0.9, 0.98 ], "adam_epsilon": 1e-08, "dropout": 0.1, "weight_decay": 0.01, "grad_norm": 1.0, "warmup_steps": 1000, "seed": 77, "fp16": true, "n_workers": 0, "pin_mem": true, "location_only": false }

What's wrong with the parameters?

I am not sure if this is due to gradients harvesting with mutliple GPUs.

My suggestion for single card is to feed the GPU with full memory and make gradient_accumulation_steps=5.

Let's see if that works.

feed the GPU with full memory ?
It means the 'train_batch_size' can use a bigger number than now used ?

Could you please provide the JSON parameter file you used for training? I'll be grateful to you.

I have been updating this repo for a while, the original JSON config does not change much.

But I do recommend the following setting for single GPU,

    "train_batch_size": 11000,
    "gradient_accumulation_steps": 5,
    "num_train_steps": 18000,

I feel sorry for reproducing issues that you encountered, I hope we can find the reason through trials.

I appreciate your kind help. I'll try the experiment again. ♥

num_train_steps = 18000 , train_batch_size = 11000
The epoch is only 2 , not 5. Does it matter?

If we use gradient_accumulation_steps, for each step, it will use five times examples.

OK! I know it. Thanks. I will try again. Hopefully it will succeed in achieving the desired score.

Yes, I also feel nervous about reproduction, although I tried my code several times.

I hope we can reproduce without difficulty and I will update the parameters for the benefit of all.

Congratulations! The new experiment can achieve the desired score (approaching but not beyond.).
image

It needs gradient_accumulation_steps = 5.
But Why ? Could you explain the principle behind this?

Glad that works!

I think larger batch size converges better, this is mainly due to the stocastic gradients accumulated is closer to the full batch case when fitting the dataset.

Well. That's amazing! I learn about that.
What's more, the two-stage training appeared in your codes isn't in detail on your paper.
What's about the two-stage? Will it get a higher score?

For two-stage, you can directly try Stage-Two. If you are interested in the paper, here is the link Two-Stage.

Glad !

I have another question.
You used 'valid_steps' to get the best score checkpoint. But in some cases, it is coincidental or fortuitous.
From my observations, the valid-dataset accuracy during the training was mostly stable at 79.
Do you want to use K-fold cross validation to get a more convincing score?

For this dataset, the size is large, so cross validation needs more computation. If the goal is to get better performance, this is a way.

In most cases, if the result can support that the method works, we will follow the train-dev-test which is used in most large scale QA task.

Biases of the dataset can be a seperate research topic.

Oh, so that's it.

Because of my shortcoming, I know this kind of training for the first time. So It may be incredible.
My teachers always used to ask me to use cross validation.
Thank you very much. I am learning a new training form with your codes.

Appreciate your work and kind help, let me learn a lot.

Thank you for say so!

I also learned a lot from the acknowledged repos. I recommend you to try their code also.

I'm sorry to bother you again.

I wanna know whether the codes of paper ( ' A BERT-based two-stage model for Chinese Chengyu recommendation ' about two-stage) are only using ' train_pretrain.py ' and ' train_official.py '?
What's the difference between the stage-1-pretain and using 'train_pretrain. py'?

What's more, What's the difference among w/o Pre-Training 、w/o Fine-Tuning 、 w/o 𝐿V and w/o 𝐿A. (I don't quite understand what you're showing in your paper.)

Could you describe more details? Thanks very much.

How about starting new issues with each question so that I can answer them one by one.

I am suggesting this because this may help others who might have similar questions.

OK !