Statsforecat.predict expects wrong dataframe shape on X_df

Question

Statsforecat.predict expects wrong dataframe shape on X_df

Closed this issue 7 months ago · 10 comments

What happened + What you expected to happen

I am trying to supply a subset of the training dataset to X_df in sf.predict but I am getting the error ValueError: Expected X to have shape (12, 2), but got (12, 3).

Changing expected_shape = (h * len(self.ga), self.ga.data.shape[1] + 1) to
expected_shape = (h * len(self.ga), self.ga.data.shape[1] + 2) seems to fix the issue

Versions / Dependencies

pytorch 2.2.0 cpu_py311hd080823_0
pytorch-lightning 2.2.1 pyhd8ed1ab_0 conda-forge
statsforecast 1.7.3 pyhd8ed1ab_0 conda-forge
statsmodels 0.14.1 py311h59ca53f_0 conda-forge

Reproduction script

import os
import pandas as pd

# this makes it so that the outputs of the predict methods have the id as a column 
# instead of as the index
os.environ['NIXTLA_ID_AS_COL'] = '1'

df = pd.read_csv('datasets/air-passengers.csv', parse_dates=['ds'])
df.head()
print(df)
#See https://nixtlaverse.nixtla.io/statsforecast/docs/getting-started/getting_started_short.html


# In[2]:


from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

sf = StatsForecast(
    models=[AutoARIMA(season_length = 12)],
    freq='M',
)

sf.fit(df)


# In[5]:


forecast_df = sf.predict(X_df=df[-12:], h=12, level=[90])
forecast_df.tail()


# In[4]:


sf.plot(df, forecast_df, level=[90])

Issue Severity

High: It blocks me from completing my task.

Answer 1 · 2024-03-10T22:43:43.000Z

Actually the quick fix expected_shape = (h * len(self.ga), self.ga.data.shape[1] + 2) gets rid of the error but the predictions won't update if passing a different slice of the dataframe in X_df

Answer 2 · 2024-03-10T23:21:39.000Z

Hey. The X_df argument is for future values of exogenous features and you don't have any, so you don't need to provide it.

Answer 3 · 2024-03-10T23:59:59.000Z

@jmoralez so what is the correct way to predict future values for some input other than the training dataframe?

Answer 4 · 2024-03-11T08:04:14.000Z

@Vitorbnc You can just run predict(h=12, level=[90]). That will give you the predictions for the timeperiod following the training period.

Answer 5 · 2024-03-11T10:10:52.000Z

@elephaint but I would like to do the following:

Fit the model on training data
Predict h steps after a given input sequence other than the training period (which I was assuming was the role of X_df)

How can I supply an unseen input sequence to the model and get it to predict the next sequence?

Answer 6 · 2024-03-11T12:00:41.000Z

@elephaint but I would like to do the following:

Fit the model on training data

Predict h steps after a given input sequence other than the training period (which I was assuming was the role of X_df)

How can I supply an unseen input sequence to the model and get it to predict the next sequence?

In an ARIMA model, we train on [x_t, x_{t+1}, ...., x_{t + T}] for a particular time series. We can then make predictions for that series only, for an arbitrary long period following x_{t + T} (by setting the horizon in our predict function). We can optionally add exogenous variables during training, and during the prediction period (the latter by including them in X_df for all dates in your forecasting horizon).

You can't 'supply an unseen input sequence' to an ARIMA model. If you have an unseen input sequence, you would normally train an ARIMA model on that unseen input sequence, and subsequently create forecasts for a horizon using that newly trained model. Each new time series requires a new ARIMA model.

Answer 7 · 2024-03-11T12:16:56.000Z

@elephaint ok then, that makes sense. Thanks for the explanation. I am trying to do the same with mlforecast, but I am getting a different error:

I am sure there is no NaN neither in the train nor the test dataframes though

Answer 8 · 2024-03-11T13:00:19.000Z

Thanks - it's very hard to debug based on this picture only. Can you share a minimal working example of your code?

Based on the picture I can only suggest to double check the existence of NaN in train_df.

Note that you are training and testing on the same timestamps - I assume that is on purpose (as it's something that you'd want to avoid normally in forecasting)? I.e. you're supplying the full train_df as training set, and use a subset of train_df as test set. Hence, any test results you get will not be representative for the actual forecasting performance. I.e. normally one would do something like this:

train_df = df[:-12]
test_df = df[-12:]

in order to properly separate train- and test sets.

Answer 9 · 2024-03-11T14:15:48.000Z

@elephaint Yes, I will separate them in the real use case, for now I am just trying to write a single function that can take ML, Stats and Neural models and predict future data for comparison.
Here is part of the code:

import pandas as pd
from utilsforecast.plotting import plot_series

import os
# Change output of cross-validation so it has the ids of the series as a column rather than as the index
os.environ['NIXTLA_ID_AS_COL'] = '1'

# See https://nixtlaverse.nixtla.io/neuralforecast/examples/getting_started.html
train_df = pd.read_csv('https://datasets-nixtla.s3.amazonaws.com/air-passengers.csv')
train_df.head()

# Forecasting horizon (number of time steps to forecast)
horizon = 12

from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA

sf = StatsForecast(
    models=[AutoARIMA(season_length = 12)],
    freq='M',
)

sf.fit(train_df)

from mlforecast import MLForecast
from mlforecast.target_transforms import Differences
from sklearn.linear_model import LinearRegression

ml = MLForecast(
    models=LinearRegression(),
    freq='MS',  # our serie has a monthly frequency
    lags=[horizon],
    target_transforms=[Differences([1])],
)
ml.fit(train_df)

def predict_multiple(forecast_objs:list, horizon:int, test_df:pd.DataFrame, train_df:pd.DataFrame):
    if not(len(forecast_objs)): return
    preds = []
    pred_cols_ref = {}
    standard_cols = ['unique_id', 'ds'] # NeuralForecast standard columns
    standard_cols_idx = standard_cols+['index'] 
    for obj in forecast_objs:
        if isinstance(obj, StatsForecast):
            if not train_df[-horizon:].equals(test_df[-horizon:]):
                new_df = pd.concat(train_df, test_df)
                preds.append(obj.forecast(h=horizon,df=new_df).reset_index())
            else:
                preds.append(obj.predict(h=horizon).reset_index())
        elif isinstance(obj, NeuralForecast):
            preds.append(obj.predict(df=test_df).reset_index()) # Same horizon must have been passed when model was built
        elif isinstance(obj,MLForecast):
            print(test_df.reset_index())
            preds.append(obj.predict(new_df=test_df, h=horizon))
        else:
            raise TypeError('forecast_objs must be a list of <Neural,Stats,ML>Forecast instances', forecast_objs)
        # Store reference to current prediction dataframe in column dict
        for col in preds[-1].columns: 
            if col not in standard_cols_idx: pred_cols_ref[col] = preds[-1]
        
    #print(f'Prediction Shapes: Stats={pred_sf.shape}, ML={pred_ml.shape}, Neural={pred_nf.shape}')
        
    # Merge predictions 
    pred_wide = preds[0]
    for i in range(len(preds)):
        if i > 0:
            pred_wide = pred_wide.merge(preds[i], how='left', on=standard_cols)

    # Fill NaN in case values are lost during merge
    filtered_cols = [x for x in pred_wide.columns if x not in standard_cols_idx]
    for col in filtered_cols:
        print(col)
        pred_wide[col].fillna(pred_cols_ref[col][col])

    print(f'Merged Predictions Shape: {pred_wide.shape}')
    return pred_wide

pred_wide = predict_multiple([sf, ml], test_df=train_df[-12:], train_df=train_df, horizon=horizon)

pred_wide.head()

Answer 10 · 2024-03-12T08:43:32.000Z

Thanks for the code. There are three issues when I'm running it, solving these three produces correct forecasts.

You need to set the ds column to have datetime format, so insert the following line after train_df.head():
train_df["ds"] = pd.to_datetime(train_df["ds"])

You need to set frequency to M:

ml = MLForecast(
    models=LinearRegression(),
    freq='M',  # our serie has a monthly frequency
    lags=[horizon],
    target_transforms=[Differences([1])],
)

You don't have to supply the test_df to the predict function, so this is what your predict function should look like:
preds.append(obj.predict(h=horizon))

I'd advise you to read the end-to-end walkthrough of ML Forecast, which may help you avoid these and potential further issues.