Attempt removing Early Stopping

Question

Attempt removing Early Stopping

Closed this issue 2 years ago · 4 comments

FrancescoCasalegno commented 3 years ago

Context

When training a NER model with SpaCy, we are using a valid split (on top of the train and test splits) to determine the optimal epoch for Early Stopping.
But this approach has downsides.
- If performance on valid split oscillates (due to noise in the data) from epoch to epoch, early stopping on valid split gives no guarantee that the epoch chosen is optimal also to maximize performance on the test split, we may just be overfitting the valid split as well.
- If we use a valid split, it means that our NER model is in fact trained using fewer samples than what we could do.
- Using Early Stopping and a valid split introduces further complexity in the experiment setup.

Actions

Plot performance (recall, precision, f1-score) vs. number of iterations¹. Do we see overfitting?
Do the same with test_loss vs. number of iterations. Do we see overfitting? How do these results compare with the ones obtained for the previous point?
Based on these results, decide whether we can remove Early Stopping, and if Yes then determine a reasonable value for n_epochs (or n_iterations).

Dependencies

Requires #601.

or number of epochs, to be determined what makes more sense. ↩

Answer 1 · 2022-06-23T09:22:48.000Z

IMO we should also talk about the benefits of using a validation set:

if performance on valid split oscillates (due to noise in the data) from epoch to epoch, early stopping on valid split gives no guarantee that the epoch chosen is optimal also to maximize performance on the test split, we may just be overfitting the valid split as well.

By always taking the model corresponding to the last iteration we are subject to noise too

If we use a valid split, it means that our NER model is in fact trained using fewer samples than what we could do.

One can say the same thing about the test set.

Using Early Stopping and a valid split introduces further complexity in the experiment setup.

Both spacy and transformers support it out of the box. Assuming we don't use it, then we will just have to hardcode n_iterations=MAGICAL_NUMBER. Related to this, this MAGICAL_NUMBER of iterations it takes to "fully" train a model will be a function of the training set size. Which means that we would have to rerun the manual analysis of finding this number each time we expand our training set (i.e. when we get new annotations).

Answer 2 · 2022-06-24T07:49:02.000Z

By always taking the model corresponding to the last iteration we are subject to noise too

Absolutely agree. I guess my point here was more like "early stopping may not help us" rather than "w/o early stopping we do better".

If we use a valid split, it means that our NER model is in fact trained using fewer samples than what we could do.

One can say the same thing about the test set.

Sure, but the test set is necessary to estimate the generalization performance (which is what we then base our assessments on how well the model is doing) of the mode, while the valid set is not strictly needed per se (unless one needs to do some sort of model selection or hyperparamter tuning like early stopping).

Related to this, this MAGICAL_NUMBER of iterations it takes to "fully" train a model will be a function of the training set size.

This is a more serious problem, I also briefly thought about it and I confess I am not sure I have an answer. What do you think of the following alternatives?

Setting n_epochs instead of n_iterations.
This will in a way already take into account the train set size. But the relation is just linear (n_iterations = k * n_epochs * train_set_size) so maybe a bit simplistic and won't really generalize well?
Setting n_epochs to an arbitrarily large number.
This seems to be a common practice, assuming that valid_acc does not show signs of overfitting. Of course there's a trade-off between runtime vs. accuracy, e.g. setting n_epochs = 1e9 may take too long to train. Would you have a reasonable estimate for a LARGE_NUMBER to set for n_epochs?

Answer 3 · 2022-06-24T08:03:05.000Z

Here are some resources I found discussing why Early Stopping isn't always a great idea.

Reddit (2018): "[D] The use of early stopping (or not!) in neural nets (Keras)"
Reddit (2018): "[Discussion] Early Stopping - Why not always?"
Andrew Ng, "Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization > Week 1 > Other regularization methods"

But to be honest, in all these cases the argument seems to be: "Don't use Early Stopping to prevent overfitting on the valid set, because this rarely happens, and even if it happens then Early Stopping is not the best way to address the problem".

And this does not answer the other question we have, i.e. "OK, but then how many iterations is enough?"...

Answer 4 · 2022-08-16T07:57:54.000Z

Update 2022-08-16

Currently, our NER training script assumes that we provide a validation split, which is used for Early Stopping
Even assuming that our model does not overfit as the number of epochs increases, we still need to decide when to stop the training
The minimum n_epochs to train the model until good accuracy is obtained however depends on the nature and on the size of the training set, so it is impossible/really hard to predict what would be a "good guess" for n_epochs

In conclusion:

For the time being, we will stick to having a valid_split and Early Stopping.

Context

Actions

Dependencies

Footnotes

Update 2022-08-16