Comparison version 0.2.6 and 0.3.0 with scibert

Question

Comparison version 0.2.6 and 0.3.0 with scibert

lfoppiano opened this issue 3 years ago · 10 comments

I've made several tests with scibert trying to keep the same conditions between the two version of delft.

#	delft version	run	architecture	batch size	max seq length	max epoch	F1
1	0.2.6	24142	scibert	6	512	50	0.8332
2	0.2.6	24063	scibert	6	512	50	0.8327
3	0.3.0	24141	BERT	20	512	60+early stop	0.8134
4	0.3.0	24138	BERT_CRF	20	512	60+early stop	0.8092
7	0.3.0	24136	BERT_CRF	20	512	60+early stop	0.8173
5	0.3.0	24146	BERT_CRF	20	512	15	0.8137
6	0.3.0	24145	BERT	20	512	15	0.8178
8	0.3.0	24147	BERT_CRF	6	512	50	0.8327
9	0.3.0	24148	BERT	6	512	50	0.8325

~~run 2 and 4 are repetition to be sure the results are consistent.~~

The dataset is the same, details here:

8167 train sequences
908 validation sequences
1009 evaluation sequences

I could try to reduce the batch-size for delft 0.3.0 but I doubt that would make any difference

Answer 1 · 2022-03-15T10:43:49.000Z

early stop was not supported in the previous DeLFT version by the BERT architecture, only by the RNN. So you were likely doing 50 epochs?

I would change the max epoch as first try.

The number of epoch for BERT-based model can be normally very low normally. I was having my best result with tf1 with number of epoch 3-5 for NER, then accuracy was unchanged or decreasing. With tf2, I keep it to 5-10.
It might depend on the training size I guess. What is the size of this training set?

With higher nb of epoch, you could try to decrease the learning rate too.

On my side for reference, for the CoNLL and Grobid models (using SciBERT), all the models using BERT gives slightly better results with the new version. Model for my largest training set ("software mention recognition" with 8M tokens) with SciBERT also shows a small improvement.

Answer 2 · 2022-03-16T00:43:23.000Z

early stop was not supported in the previous DeLFT version by the BERT architecture, only by the RNN. So you were likely doing 50 epochs?

Ah ok, I updated the two rows then

I would change the max epoch as first try.

The number of epoch for BERT-based model can be normally very low normally. I was having my best result with tf1 with number of epoch 3-5 for NER, then accuracy was unchanged or decreasing. With tf2, I keep it to 5-10. It might depend on the training size I guess. What is the size of this training set?

The training set is

8167 train sequences
908 validation sequences
1009 evaluation sequences

I changed epoch to 15 (see row 5 and 6) and the scores are improving a bit, but not quite as much as row 1 and 2. Should I reduce it even more? like to 5 or 10?

With higher nb of epoch, you could try to decrease the learning rate too.

On my side for reference, for the CoNLL and Grobid models (using SciBERT), all the models using BERT gives slightly better results with the new version. Model for my largest training set ("software mention recognition" with 8M tokens) with SciBERT also shows a small improvement.

Answer 3 · 2022-03-16T01:24:01.000Z

I changed epoch to 15 (see row 5 and 6) and the scores are improving a bit, but not quite as much as row 1 and 2. Should I reduce it even more? like to 5 or 10?

I would try 5 to see, but I had good results with 15. Maybe decrease batch size to 6 to check too. It's unexpected to see lower score with CRF, normally it improves a bit. It's possible to use BERT_ChainCRF as implementation variant to double check.

Answer 4 · 2022-04-30T12:25:14.000Z

Note sure it's useful but, to be sure to use all training available with BERT, note that early_stop parameter is by default true, so you have to set early_stop to false explicitly before training to be sure it's not used. Then both the train and validation set will be used for training until the max_epoch is reached.

Answer 5 · 2022-05-10T00:49:56.000Z

OK. It seems that the results are comparable 🎉 , see runs 8 and 9.

Answer 6 · 2022-05-10T09:00:16.000Z

After a lot of tries, with the new version I obtain the best result using early_stop=True for architectures using BERT, despite some part of the training data is used to check the stop criteria.

Answer 7 · 2022-05-10T09:47:32.000Z

With early_stop=True for me was not the case, did you use any special parameter?

Answer 8 · 2022-05-19T08:41:56.000Z

I tested again with the latest changes. With early_stop=True I get worst results unfortunately..

Answer 9 · 2022-05-20T02:00:21.000Z

Here the comparison between early_stop=True and early_stop=False for DeLFT 0.3.0.

#	run	architecture	transformer	batch size	max seq length	max epoch	early_stop	F1
1	24304	BERT_CRF	scibert_cased	20	512	60	False	82.99
2	24311	BERT_CRF	scibert_cased	20	512	60	True	81.44
3	24305	BERT	scibert_cased	20	512	60	False	82.73
4	24312	BERT	scibert_cased	20	512	60	True	81.70
5	24307	BERT_CRF_FEATURES	scibert_cased	20	512	60	False	83.31
6		BERT_CRF_FEATURES	scibert_cased	20	512	60	True
7	24418	BERT_CRF_FEATURES	scibert_cased	20	512	10	False	81.29
8	24303	BERT_CRF	matscibert	20	512	60	False	82.88
9	24313	BERT_CRF	matscibert	20	512	60	True	81.52

Could be that the 1000 validation examples are making such difference?

Answer 10 · 2022-06-13T03:15:21.000Z

I'm closing for the moment, as I've managed to obtain the same results with delft 0.3.0 as I had before 🎉