Mismatch between loss and multistep sinusoid predictor
Closed this issue · 3 comments
I reproduced the plot in the readme for the multistep predictor of a sinusoid, but after changing some hyperparams I'm seeing a mismatch between the loss and the predictive power. Below are the losses for a run with default params and a run with another set of hyperparams labelled"best-1lyr" (lr=0.00843, decay factor=.9673, num features=110):
Both converge to a stable result after ~60 epochs, however their predictions are not stable, nor do they match up with predictive power. Below are gifs of the predictor output for default params and my other set of params, respectively:
The run with default parameters appears to jump out of a locally convex region and into another around the 50th epoch. It actually does this twice, and the 100th epoch prediction is the one with higher magnitude noise at the start and end of the prediction. The run with new parameters seems to be remain in a fixed region of the cost surface, however it has consistently much lower predictive power than the run with default parameters, while at the same time achieving a loss below the run with default parameters. Any ideas what issue(s) I might be running in to?
One thing to note is that it appears there is some randomness in training even though the code sets random seeds for torch and numpy. I get different loss curves for multiple runs of the default params, but, oddly, they only diverge after exactly 15 epochs. Also this note that training curves look pretty much the same.
Thank you for the detailed feedback. I wrote the code almost a year ago.. but as far as I remember I just used the parameter that worked well for the single-step prediction. The model is sensitive to the number of features and heads - some combinations lead to models that do not learn anything useful.
However, your question was:
Why does the loss not correlate with the predictive power of the model?
This is might be related to the way I calculated the loss. Back then I explored to options:
- Calculate the loss over the predicted values. E.g. with a window size of 5 for the last 5 values.
- Calculate the loss overall values that the model outputs. Sometimes this is called teacher forcing.
The basic idea of option 2 is, that the model outputs a value for every timestep. If you calculate the loss over the whole sequence you force the model to produce correct values for every timestep, not only for your output window. I hoped that this will also foce the model to pick up the overall pattern.
Currently active in the code is option two. However, it looks like this has side effects. For example, if the model predicts all 5 future timesteps wrong, but all other 100 values from the input window somewhat close, then this model will have a low loss but no predictive power. And it might be easier for the model to optimize for the 100 input values than for the 5 prediction values, as they will have a greater impact on the loss.
Regarding the randomness. I just recently figured out that one has to set a bunch of flags, to make it really reproducible. I will have to look at it.
I introduced a flag called "calculate_loss_over_all_values", which let you choose between both options.
Results after 100 epoch, 110 features, lr = 0.005, gamma=0.98:
calculate_loss_over_all_values = True
loss = 0.00057
calculate_loss_over_all_values = False
loss = 0.00886
Hope that helps.
Feel free to reopen if that does not solve the problem.