MaxHalford/maxhalford.github.io

blog/online-learning-evaluation/

utterances-bot opened this issue ยท 19 comments

The correct way to evaluate online machine learning models - Max Halford

Motivation Most supervised machine learning algorithms work in the batch setting, whereby they are fitted on a training set offline, and are used to predict the outcomes of new samples. The only way for batch machine learning algorithms to learn from new samples is to train them from scratch with both the old samples and the new ones. Meanwhile, some learning algorithms are online, and can predict as well as update themselves when new samples are available.

https://maxhalford.github.io/blog/online-learning-evaluation/

Hi Max, Very interesting blog that gives an introduction about online learning. I am particularly interested in the oscillating MSE loss for the taxi dataset. You speculated that it might be due to there is an underlying seasonality that the model is not capturing, which I also agree with. But it seems neither the progressive validation nor the delayed progressive validation addressed such problem. What would you think that may solve this problem in online learning?

Hey @q138ben. Well the point of progressive validation isn't to solve the oscillating issue. The point of this blog post was mostly to give discuss the correct way of validating an online model.

I believe that the oscillation issue is a typical example of "drift". Therefore, using a model with a more aggressive learning rate could and should work. Also, using a nearest neighbours approach where the nearest neighbours are selected from the recent observations should work well as well.

I hope this helps :). If I get some time I'll add an example in another blog post.

My bad, I misunderstood you. I was referring to the cyclic aspect of the error line.

As you said, the oscillation should be able to be fixed by using mini-batches when computing the error gradient. Since writing this blog post, we've implemented some mini-batch methods in creme. It would be worthwhile to try these out on this particular dataset!

Kind regards :)

regarding the progressive validation technique where (f(x), y) is evaluated before an update, isn't that essentially deploying an untested model? In other words you update the model w/ the new (x,y) pairs and assume there's no catastrophic performance regression?

@eggie5 I'm not 100% what you mean. f(x) is necessarily evaluated before y is made available due to the arrival times of x and y. Catastrophic performance regression isn't a thing here because we're constantly staying up to date with the latest data.

@MaxHalford (difficult b/c github doesn't have latex but...) Under progressive validation, at time t we have a model y -> f(x). Then at t+1, new data (x,y) comes in and we predict y -> f(x) and run evaluation. Then we update f based on the gradients of f(x) and y.

this means we have 2 models f at time t and f at time t+1. The latter is not tested and is what is being promoted to production.

I'm under the school of thought that every model should be tested before deployment. Is this not the case w/ progressive validation?

@eggie5 Just the clarify the discussion: f is the model. x are the features. y is the ground truth. p = f(x) is the prediction.

When x first arrives, we make a prediction p = f(x). When y arrives (which is necessarily after x), we look at the difference between p and y in order to adjust the model. We don't need to repredict p because it's already been done.

The model is always up to date with the latest y data. It can't be "tested" on future data because, well, that data isn't available. We do however have an idea of the model's performance on the latest data so that gives us a pretty good idea of how we're doing.

It's nice to be able to test a model before deploying it, but the situation here is different. We're constantly updating the model with new data, so we can worry less about our model's performance drifting.

@MaxHalford

Just the clarify the discussion: f is the model. x are the features. y is the ground truth. p = f(x) is the prediction.

When x first arrives, we make a prediction p = f(x). When y arrives (which is necessarily after x), we look at the difference between p and y in order to adjust the model. We don't need to repredict p because it's already been done.

๐Ÿ‘

Thanks, we're on the same page.

I just wanted to confirm, that w/ this approach that we are technically deploying an untested model. However, the implications of that may be small b/c the new data may only be 1 (x, y) sample or a small batch. I guess the question is: how sensitive is your model to new data? Could a new batch drastically shift performance? I think for most of us the answer is not much.

Considering the alternative, a 1-step lag (ie 1-day), training on (t-1) and evaluating on (t), this allows a faster response to new data in the online setting.

@eggie5 happy we're aligned!

There are two reasons why performance may plummet:

  • The model is not capable to adapt itself fast enough.
  • The ground truth are not being fed to the model fast enough.

As always, it depends on your specific situation. Note that in my post I don't discuss batches: it's pure online learning with one sample at a time.

Hope this helps.

@MaxHalford follow-up question: any comments or insights into possibly bootstrapping the model to start online (stochastic) learning?

@eggie5 what exactly do you mean by bootstrapping? Like warming it up?

@MaxHalford yes, instead of starting learning from a random state, pre-train the model on some historical data in a batch-offline setting. Then start online-learning from the bootstrapped steady-state.

@eggie5 yes that's a very good thing to do when possible. The benefit of online learning is that you don't have to store data. But if you are storing data, then you may as well make the most of all the downtime that you have.

The benefit of online learning is that you don't have to store data.

๐Ÿคฏ
interesting perspectives, it's difficult to switch to the streaming mindset. thank you.

Thank you so much for the clear guidance. Really helpful!!!

I've spent a couple of hours looking into streaming data validation in ML, and this is what made the most sense to me. Thanks!

@MaxHalford Thank you for this in-depth post. What I haven't quite understood yet, online learning approaches basically have the problem with catastrophic forgetting. In your post from on Jan 11, 2021 you say that this problem is not present in your approach. Could you please explain this further?

Thanks in advance.

Hey @bayramf. Catastrophic forgetting is the issue of forgetting info from samples seen in the past. The thing is, when you're running your model online, it's actually ok to forget the past. All that matters is that the model is good on the current data. So having a model which remembers recent data and forgets data (i.e. what some would call catastrophic forgetting) is actually desirable, in an online setting that is.