Comparison version 0.2.6 and 0.3.0 with scibert
lfoppiano opened this issue · 10 comments
I've made several tests with scibert trying to keep the same conditions between the two version of delft.
# | delft version | run | architecture | batch size | max seq length | max epoch | F1 |
---|---|---|---|---|---|---|---|
1 | 0.2.6 | 24142 | scibert | 6 | 512 | 50 | 0.8332 |
2 | 0.2.6 | 24063 | scibert | 6 | 512 | 50 | 0.8327 |
3 | 0.3.0 | 24141 | BERT | 20 | 512 | 60+early stop | 0.8134 |
4 | 0.3.0 | 24138 | BERT_CRF | 20 | 512 | 60+early stop | 0.8092 |
7 | 0.3.0 | 24136 | BERT_CRF | 20 | 512 | 60+early stop | 0.8173 |
5 | 0.3.0 | 24146 | BERT_CRF | 20 | 512 | 15 | 0.8137 |
6 | 0.3.0 | 24145 | BERT | 20 | 512 | 15 | 0.8178 |
8 | 0.3.0 | 24147 | BERT_CRF | 6 | 512 | 50 | 0.8327 |
9 | 0.3.0 | 24148 | BERT | 6 | 512 | 50 | 0.8325 |
run 2 and 4 are repetition to be sure the results are consistent.
The dataset is the same, details here:
8167 train sequences
908 validation sequences
1009 evaluation sequences
I could try to reduce the batch-size for delft 0.3.0 but I doubt that would make any difference
early stop
was not supported in the previous DeLFT version by the BERT architecture, only by the RNN. So you were likely doing 50 epochs?
I would change the max epoch as first try.
The number of epoch for BERT-based model can be normally very low normally. I was having my best result with tf1 with number of epoch 3-5 for NER, then accuracy was unchanged or decreasing. With tf2, I keep it to 5-10.
It might depend on the training size I guess. What is the size of this training set?
With higher nb of epoch, you could try to decrease the learning rate too.
On my side for reference, for the CoNLL and Grobid models (using SciBERT), all the models using BERT gives slightly better results with the new version. Model for my largest training set ("software mention recognition" with 8M tokens) with SciBERT also shows a small improvement.
early stop
was not supported in the previous DeLFT version by the BERT architecture, only by the RNN. So you were likely doing 50 epochs?
Ah ok, I updated the two rows then
I would change the max epoch as first try.
The number of epoch for BERT-based model can be normally very low normally. I was having my best result with tf1 with number of epoch 3-5 for NER, then accuracy was unchanged or decreasing. With tf2, I keep it to 5-10. It might depend on the training size I guess. What is the size of this training set?
The training set is
8167 train sequences
908 validation sequences
1009 evaluation sequences
I changed epoch to 15 (see row 5 and 6) and the scores are improving a bit, but not quite as much as row 1 and 2. Should I reduce it even more? like to 5 or 10?
With higher nb of epoch, you could try to decrease the learning rate too.
On my side for reference, for the CoNLL and Grobid models (using SciBERT), all the models using BERT gives slightly better results with the new version. Model for my largest training set ("software mention recognition" with 8M tokens) with SciBERT also shows a small improvement.
I changed epoch to 15 (see row 5 and 6) and the scores are improving a bit, but not quite as much as row 1 and 2. Should I reduce it even more? like to 5 or 10?
I would try 5 to see, but I had good results with 15. Maybe decrease batch size to 6 to check too. It's unexpected to see lower score with CRF, normally it improves a bit. It's possible to use BERT_ChainCRF as implementation variant to double check.
Note sure it's useful but, to be sure to use all training available with BERT, note that early_stop
parameter is by default true, so you have to set early_stop
to false explicitly before training to be sure it's not used. Then both the train and validation set will be used for training until the max_epoch
is reached.
OK. It seems that the results are comparable 🎉 , see runs 8 and 9.
After a lot of tries, with the new version I obtain the best result using early_stop=True
for architectures using BERT, despite some part of the training data is used to check the stop criteria.
With early_stop=True
for me was not the case, did you use any special parameter?
I tested again with the latest changes. With early_stop=True
I get worst results unfortunately..
Here the comparison between early_stop=True
and early_stop=False
for DeLFT 0.3.0.
# | run | architecture | transformer | batch size | max seq length | max epoch | early_stop | F1 |
---|---|---|---|---|---|---|---|---|
1 | 24304 | BERT_CRF | scibert_cased | 20 | 512 | 60 | False | 82.99 |
2 | 24311 | BERT_CRF | scibert_cased | 20 | 512 | 60 | True | 81.44 |
3 | 24305 | BERT | scibert_cased | 20 | 512 | 60 | False | 82.73 |
4 | 24312 | BERT | scibert_cased | 20 | 512 | 60 | True | 81.70 |
5 | 24307 | BERT_CRF_FEATURES | scibert_cased | 20 | 512 | 60 | False | 83.31 |
6 | BERT_CRF_FEATURES | scibert_cased | 20 | 512 | 60 | True | ||
7 | 24418 | BERT_CRF_FEATURES | scibert_cased | 20 | 512 | 10 | False | 81.29 |
8 | 24303 | BERT_CRF | matscibert | 20 | 512 | 60 | False | 82.88 |
9 | 24313 | BERT_CRF | matscibert | 20 | 512 | 60 | True | 81.52 |
Could be that the 1000 validation examples are making such difference?
I'm closing for the moment, as I've managed to obtain the same results with delft 0.3.0 as I had before 🎉