Question concerning training RNNs

Question

PelFritz opened this issue 2 years ago · 2 comments

Hi I have some questions with regards to generating the train, val and test sets for the temperature forecasting problem.

When using the timeseries_dataset_from_array tool, we passed the argument shuffle= True. I thought we had to maintain the order of things within the data, so I expected shuffling to be harmful to the model. It turns out however that if I do not shuffle I instead get a worst model.
When building our model, from my understanding of the code, the input data still has the temperature column within it, even though we are using temp as our targets. Is that not some kind of data leak ?

Answer 1 · 2021-12-27T13:36:27.000Z

shuffling in this sense does not mean you are shuffling the order within each sequence, but the order of sequences drawn. Assume your sequence data is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] and your sequence_length = 2, then shuffle=False would draw your sequences in chronological order [0, 1], [1, 2], [2, 3], [3, 4], ... while shuffle=True could lead to [3, 4], [0, 1], [5, 6], [2, 3], ...

You see that the order within sequences is still intact.

I don't own the book yet, so I'm answering this without further context. I would assume that the temperature column in the input data being still present does not matter (rather the opposite: it is essential) because you are trying to predict the temperature at a point in the future. That's why you have the delay variable for data=raw_data[:-delay] and targets= temperature[delay:].

Answer 2 · 2021-12-27T14:28:22.000Z

Hi @pkienle thank you for your insights.