Statsforecat.predict expects wrong dataframe shape on X_df
Closed this issue · 10 comments
What happened + What you expected to happen
I am trying to supply a subset of the training dataset to X_df in sf.predict but I am getting the error ValueError: Expected X to have shape (12, 2), but got (12, 3)
.
Changing expected_shape = (h * len(self.ga), self.ga.data.shape[1] + 1)
to
expected_shape = (h * len(self.ga), self.ga.data.shape[1] + 2)
seems to fix the issue
Versions / Dependencies
pytorch 2.2.0 cpu_py311hd080823_0
pytorch-lightning 2.2.1 pyhd8ed1ab_0 conda-forge
statsforecast 1.7.3 pyhd8ed1ab_0 conda-forge
statsmodels 0.14.1 py311h59ca53f_0 conda-forge
Reproduction script
import os
import pandas as pd
# this makes it so that the outputs of the predict methods have the id as a column
# instead of as the index
os.environ['NIXTLA_ID_AS_COL'] = '1'
df = pd.read_csv('datasets/air-passengers.csv', parse_dates=['ds'])
df.head()
print(df)
#See https://nixtlaverse.nixtla.io/statsforecast/docs/getting-started/getting_started_short.html
# In[2]:
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA
sf = StatsForecast(
models=[AutoARIMA(season_length = 12)],
freq='M',
)
sf.fit(df)
# In[5]:
forecast_df = sf.predict(X_df=df[-12:], h=12, level=[90])
forecast_df.tail()
# In[4]:
sf.plot(df, forecast_df, level=[90])
Issue Severity
High: It blocks me from completing my task.
Actually the quick fix expected_shape = (h * len(self.ga), self.ga.data.shape[1] + 2)
gets rid of the error but the predictions won't update if passing a different slice of the dataframe in X_df
Hey. The X_df argument is for future values of exogenous features and you don't have any, so you don't need to provide it.
@jmoralez so what is the correct way to predict future values for some input other than the training dataframe?
@Vitorbnc You can just run predict(h=12, level=[90])
. That will give you the predictions for the timeperiod following the training period.
@elephaint but I would like to do the following:
- Fit the model on training data
- Predict
h
steps after a given input sequence other than the training period (which I was assuming was the role ofX_df
)
How can I supply an unseen input sequence to the model and get it to predict the next sequence?
@elephaint but I would like to do the following:
- Fit the model on training data
- Predict
h
steps after a given input sequence other than the training period (which I was assuming was the role ofX_df
)How can I supply an unseen input sequence to the model and get it to predict the next sequence?
In an ARIMA model, we train on [x_t, x_{t+1}, ...., x_{t + T}] for a particular time series. We can then make predictions for that series only, for an arbitrary long period following x_{t + T} (by setting the horizon in our predict function). We can optionally add exogenous variables during training, and during the prediction period (the latter by including them in X_df
for all dates in your forecasting horizon).
You can't 'supply an unseen input sequence' to an ARIMA model. If you have an unseen input sequence, you would normally train an ARIMA model on that unseen input sequence, and subsequently create forecasts for a horizon using that newly trained model. Each new time series requires a new ARIMA model.
@elephaint ok then, that makes sense. Thanks for the explanation. I am trying to do the same with mlforecast
, but I am getting a different error:
I am sure there is no NaN
neither in the train nor the test dataframes though
Thanks - it's very hard to debug based on this picture only. Can you share a minimal working example of your code?
Based on the picture I can only suggest to double check the existence of NaN in train_df.
Note that you are training and testing on the same timestamps - I assume that is on purpose (as it's something that you'd want to avoid normally in forecasting)? I.e. you're supplying the full train_df as training set, and use a subset of train_df as test set. Hence, any test results you get will not be representative for the actual forecasting performance. I.e. normally one would do something like this:
train_df = df[:-12]
test_df = df[-12:]
in order to properly separate train- and test sets.
@elephaint Yes, I will separate them in the real use case, for now I am just trying to write a single function that can take ML, Stats and Neural models and predict future data for comparison.
Here is part of the code:
import pandas as pd
from utilsforecast.plotting import plot_series
import os
# Change output of cross-validation so it has the ids of the series as a column rather than as the index
os.environ['NIXTLA_ID_AS_COL'] = '1'
# See https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started.html
train_df = pd.read_csv('https://datasets-nixtla.s3.amazonaws.com/air-passengers.csv')
train_df.head()
# Forecasting horizon (number of time steps to forecast)
horizon = 12
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA
sf = StatsForecast(
models=[AutoARIMA(season_length = 12)],
freq='M',
)
sf.fit(train_df)
from mlforecast import MLForecast
from mlforecast.target_transforms import Differences
from sklearn.linear_model import LinearRegression
ml = MLForecast(
models=LinearRegression(),
freq='MS', # our serie has a monthly frequency
lags=[horizon],
target_transforms=[Differences([1])],
)
ml.fit(train_df)
def predict_multiple(forecast_objs:list, horizon:int, test_df:pd.DataFrame, train_df:pd.DataFrame):
if not(len(forecast_objs)): return
preds = []
pred_cols_ref = {}
standard_cols = ['unique_id', 'ds'] # NeuralForecast standard columns
standard_cols_idx = standard_cols+['index']
for obj in forecast_objs:
if isinstance(obj, StatsForecast):
if not train_df[-horizon:].equals(test_df[-horizon:]):
new_df = pd.concat(train_df, test_df)
preds.append(obj.forecast(h=horizon,df=new_df).reset_index())
else:
preds.append(obj.predict(h=horizon).reset_index())
elif isinstance(obj, NeuralForecast):
preds.append(obj.predict(df=test_df).reset_index()) # Same horizon must have been passed when model was built
elif isinstance(obj,MLForecast):
print(test_df.reset_index())
preds.append(obj.predict(new_df=test_df, h=horizon))
else:
raise TypeError('forecast_objs must be a list of <Neural,Stats,ML>Forecast instances', forecast_objs)
# Store reference to current prediction dataframe in column dict
for col in preds[-1].columns:
if col not in standard_cols_idx: pred_cols_ref[col] = preds[-1]
#print(f'Prediction Shapes: Stats={pred_sf.shape}, ML={pred_ml.shape}, Neural={pred_nf.shape}')
# Merge predictions
pred_wide = preds[0]
for i in range(len(preds)):
if i > 0:
pred_wide = pred_wide.merge(preds[i], how='left', on=standard_cols)
# Fill NaN in case values are lost during merge
filtered_cols = [x for x in pred_wide.columns if x not in standard_cols_idx]
for col in filtered_cols:
print(col)
pred_wide[col].fillna(pred_cols_ref[col][col])
print(f'Merged Predictions Shape: {pred_wide.shape}')
return pred_wide
pred_wide = predict_multiple([sf, ml], test_df=train_df[-12:], train_df=train_df, horizon=horizon)
pred_wide.head()
Thanks for the code. There are three issues when I'm running it, solving these three produces correct forecasts.
- You need to set the
ds
column to have datetime format, so insert the following line aftertrain_df.head()
:
train_df["ds"] = pd.to_datetime(train_df["ds"])
- You need to set frequency to
M
:ml = MLForecast( models=LinearRegression(), freq='M', # our serie has a monthly frequency lags=[horizon], target_transforms=[Differences([1])], )
- You don't have to supply the
test_df
to the predict function, so this is what your predict function should look like:
preds.append(obj.predict(h=horizon))
I'd advise you to read the end-to-end walkthrough of ML Forecast, which may help you avoid these and potential further issues.