Attempt removing Early Stopping
Closed this issue · 4 comments
Context
- When training a NER model with SpaCy, we are using a
valid
split (on top of thetrain
andtest
splits) to determine the optimal epoch for Early Stopping. - But this approach has downsides.
- If performance on
valid
split oscillates (due to noise in the data) from epoch to epoch, early stopping onvalid
split gives no guarantee that the epoch chosen is optimal also to maximize performance on thetest
split, we may just be overfitting thevalid
split as well. - If we use a
valid
split, it means that our NER model is in fact trained using fewer samples than what we could do. - Using Early Stopping and a
valid
split introduces further complexity in the experiment setup.
- If performance on
Actions
- Plot performance (recall, precision, f1-score) vs. number of iterations1. Do we see overfitting?
- Do the same with test_loss vs. number of iterations. Do we see overfitting? How do these results compare with the ones obtained for the previous point?
- Based on these results, decide whether we can remove Early Stopping, and if Yes then determine a reasonable value for
n_epochs
(orn_iterations
).
Dependencies
- Requires #601.
Footnotes
-
or number of epochs, to be determined what makes more sense. ↩
IMO we should also talk about the benefits of using a validation set:
if performance on valid split oscillates (due to noise in the data) from epoch to epoch, early stopping on valid split gives no guarantee that the epoch chosen is optimal also to maximize performance on the test split, we may just be overfitting the valid split as well.
By always taking the model corresponding to the last iteration we are subject to noise too
If we use a valid split, it means that our NER model is in fact trained using fewer samples than what we could do.
One can say the same thing about the test
set.
Using Early Stopping and a valid split introduces further complexity in the experiment setup.
Both spacy
and transformers
support it out of the box. Assuming we don't use it, then we will just have to hardcode n_iterations=MAGICAL_NUMBER
. Related to this, this MAGICAL_NUMBER
of iterations it takes to "fully" train a model will be a function of the training set size. Which means that we would have to rerun the manual analysis of finding this number each time we expand our training set (i.e. when we get new annotations).
By always taking the model corresponding to the last iteration we are subject to noise too
Absolutely agree. I guess my point here was more like "early stopping may not help us" rather than "w/o early stopping we do better".
If we use a
valid
split, it means that our NER model is in fact trained using fewer samples than what we could do.One can say the same thing about the
test
set.
Sure, but the test
set is necessary to estimate the generalization performance (which is what we then base our assessments on how well the model is doing) of the mode, while the valid
set is not strictly needed per se (unless one needs to do some sort of model selection or hyperparamter tuning like early stopping).
Related to this, this
MAGICAL_NUMBER
of iterations it takes to "fully" train a model will be a function of the training set size.
This is a more serious problem, I also briefly thought about it and I confess I am not sure I have an answer. What do you think of the following alternatives?
- Setting
n_epochs
instead ofn_iterations
.
This will in a way already take into account thetrain
set size. But the relation is just linear (n_iterations = k * n_epochs * train_set_size
) so maybe a bit simplistic and won't really generalize well? - Setting
n_epochs
to an arbitrarily large number.
This seems to be a common practice, assuming thatvalid_acc
does not show signs of overfitting. Of course there's a trade-off between runtime vs. accuracy, e.g. settingn_epochs = 1e9
may take too long to train. Would you have a reasonable estimate for aLARGE_NUMBER
to set forn_epochs
?
Here are some resources I found discussing why Early Stopping isn't always a great idea.
- Reddit (2018): "[D] The use of early stopping (or not!) in neural nets (Keras)"
- Reddit (2018): "[Discussion] Early Stopping - Why not always?"
- Andrew Ng, "Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization > Week 1 > Other regularization methods"
But to be honest, in all these cases the argument seems to be: "Don't use Early Stopping to prevent overfitting on the valid
set, because this rarely happens, and even if it happens then Early Stopping is not the best way to address the problem".
And this does not answer the other question we have, i.e. "OK, but then how many iterations is enough?"...
Update 2022-08-16
- Currently, our NER training script assumes that we provide a validation split, which is used for Early Stopping
- Even assuming that our model does not overfit as the number of epochs increases, we still need to decide when to stop the training
- The minimum
n_epochs
to train the model until good accuracy is obtained however depends on the nature and on the size of the training set, so it is impossible/really hard to predict what would be a "good guess" forn_epochs
In conclusion:
- For the time being, we will stick to having a
valid_split
and Early Stopping.