equinor/gordo

TimeSeriesDataset not feeding data in consecutive order

Opened this issue · 1 comments

Currently, TimeSeriesDataset not feeding data to the model in consecutive time order, if there is an interval gap in the data.
This is causing poor results in LSTM based models.

Do you actually mean non-consecutive (i.e. non-sorted, e.g. 1,5,3,2), or do you mean that is has gaps (i.e. 1,2,5,6)? If it is the former then that is reasonable easy to fix, each data provider should deliver it timeseries sorted, and then the timeseries dataset will give a sorted output.

If the problem is that it has holes then its a bit harder to change, as that is just how it is when there are row filtering. One could of course change the signature to return a list/stream of (X,y) pairs instead of a single (X,y) pair. But then what do you do when it is time to train the model? Most/many scikit-learn models does not support iterative training on several datasets, so you can fit it several times, but it is only the data from the last fit which is used. Estimators which support iterative fitting implements partial_fit, but I don't think scikit learn pipelines or TransformedTargetRegressor play nicely with these (that is something to investigate). If they work nicely with partial_fit then one way forward could be to

  1. change the signature of the dataset so it returns a list of X,y pairs
  2. Implement partial_fit in the gordo keras wrapper classes
  3. change the builder so it uses partial_fit instead of fit (maybe dynamically depending on whether it gets a single X,y pair or a list of them).

If pipelines does not support partial_fit nicely then there is another alternative (but its not pretty), and that is to try to identify the different segments inside the fit function of the gordo keras lstm-class, and call the tensorflow fit function iteratively on the identified sequences. The issue is, how do one identify the sequences? It can be identified from the index of the timeseries data frame, but the problem is that the dataframe is long gone when we get down to the fit-function, since the scalers returns numpy arrays, not dataframes! There is a project for pandas-sklearn integration https://github.com/scikit-learn-contrib/sklearn-pandas which has a class DataFrameMapper which can change the scikit-learn stuff to rather return dataframes, so that can be used. But how? Should it dynamically("magically") wrap all the scikit-learn stuff in a provided model definiton?
Alternatively one can copy the index into the dataframe as a column (or create a new boolean column which just sais if the current row is directly following the previous, or if there is a gap between then), but then one must be careful to not start doing predictions on it.

Before starting this I would try to collect some good data how big of a problem this is on some real machines.