`aggregate` adds leading zeros to series with different dates
Closed this issue · 0 comments
NudnikShpilkis commented
The aggregate
function adds leading zeros to datasets with different dates per time series. Here's a minimal example:
import pandas as pd
import statsforecast.models as sfm
import hierarchicalforecast.methods as hfm
from statsforecast.utils import generate_series
from statsforecast import StatsForecast
from hierarchicalforecast.utils import aggregate
from hierarchicalforecast.core import HierarchicalReconciliation
max_tenure = 24
dates = pd.date_range(start='2019-01-31', freq='M', periods=max_tenure)
cohort_tenure = [24, 23, 22, 21]
ts_list = []
# Create ts for each cohort
for i in range(len(cohort_tenure)):
ts_list.append(
generate_series(n_series=1, freq='M', min_length=cohort_tenure[i], max_length=cohort_tenure[i]).reset_index() \
.assign(ult=i) \
.assign(ds=dates[-cohort_tenure[i]:]) \
.drop(columns=['unique_id'])
)
df = pd.concat(ts_list, ignore_index=True)
# Create categories
df.loc[df['ult'] < 2, 'pen'] = 'a'
df.loc[df['ult'] >= 2, 'pen'] = 'b'
# Note that unique id requires strings
df['ult'] = df['ult'].astype(str)
hier_levels = [
['pen'],
['pen', 'ult'],
]
hier_df, S_df, tags = aggregate(df=df, spec=hier_levels)
hier_df = hier_df.reset_index()
# .query("unique_id.str.split('/').str[0] <= ds.dt.strftime('%Y-%m')")
print('S_df.shape', S_df.shape)
print('hier_df.shape', hier_df.shape)
If you query the 3rd cohort, we should see dates starting with 2019-03-31
df.query("ult == '2'")
But if you query hier_df
, the output of aggregate, you'll see dates starting from 2019-01-31, the earliest date in the dataset.
hier_df.query("unique_id.str.split('/').str[-1] == '2'")
If you remove the leading zero's, reconcile
fails because forecast_fitted_values
cannot be reshaped into length of `S_df'.