heidelbergcement/hcrystalball

[OTHER] Feedback on reduction to regression `get_sklearn_wrapper`

mloning opened this issue · 5 comments

Hi, following our conversion yesterday, I have two questions about the get_sklearn_wrapper:

1. API: Why do you dynamically create the init and pass the class rather than simply using composition and passing the object?

forecaster = get_sklearn_wrapper(DummyRegressor, lags=3)  # you do this
forecaster = get_sklearn_wrapper(DummyRegressor(), lags=3)  # why not this?
  • This would avoid the issues you seem to have with local class definitions and pickling
  • This would allow you to pass composite models like pipelines or GridSearchCV into the wrapper
  • I really like the dynamic init creator, but my intuition tells me it's misapplied here.

2. Algorithm: Why not use standard recursive strategy?

from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.dummy import DummyRegressor
import pandas as pd
import numpy as np

index = pd.date_range("2000", periods=13, freq="Y")
y_train = pd.Series(np.arange(10), index=index[:-3])
X_train = pd.DataFrame(index=y_train.index)
X_test = pd.DataFrame(index=index[-3:])

model = get_sklearn_wrapper(DummyRegressor, lags=3)
model.fit(X_train, y_train)
model.predict(X_test)
# >>> 2010-12-31	7.0
# >>> 2011-12-31	7.0
# >>> 2012-12-31	7.0

# you use the first 3 values as lagged variables, the DummyRegressor simply computes the mean of the rest
# so shouldn't the result be the following?
y_train.iloc[3:].mean()  # >>> 6.0

# the problem seems to be in the way you generate the target series 
# you increase the gap between lagged variables and target to match the length of 
# forecasting horizon 
X, y = model._transform_data_to_tsmodel_input_format(X_train, y_train, len(X_test))
pd.concat([X, pd.Series(y, index=X.index, name="y")], axis=1).head()
#	lag_0	lag_1	lag_2	y
#5	2.0	1.0	0.0	5
#6	3.0	2.0	1.0	6
#7	4.0	3.0	2.0	7
#8	5.0	4.0	3.0	8
#9	6.0	5.0	4.0	9

Hope this helps!

@mloning ad the result of DummyRegressor output - the default strategy of DummyRegressor is mean (but mean of whole training dataset)

Parts from the DummyRegressor docstrings

Parameters
    ----------
    strategy : str
        Strategy to use to generate predictions.

        * "mean": always predicts the mean of the training set

Examples
    --------
    >>> import numpy as np
    >>> from sklearn.dummy import DummyRegressor
    >>> X = np.array([1.0, 2.0, 3.0, 4.0])
    >>> y = np.array([2.0, 3.0, 5.0, 10.0])
    >>> dummy_regr = DummyRegressor(strategy="mean")
    >>> dummy_regr.fit(X, y)
    DummyRegressor()
    >>> dummy_regr.predict(X)
    array([5., 5., 5., 5.])

@therhaag any thoughts on the class versus object mentioned by Markus?

Hi @mloning - sorry for the delayed response. The reason for passing the class rather than using composition is to avoid the need for a custom get_params/set_params implementation which distinguishes between parameters of the wrapper and parameters of the class being wrapped. The sklearn implementation of these functions in base.py relies on discovering all possible parameters of the estimator from the signature of the constructor, so we need to expose those through the constructor of the wrapper. There may be a simpler way to do this, though, more close to how composite models behave - in fact, I remember looking into this option but I can't remember what was the difficulty with that, so I'd be happy to consider alternatives.

@therhaag but sklearn supports composition with nested parameters using their double underscore convention, I think that would be the cleaner solution in the end.

Hi @mloning - I think you're right. We added the sklearn wrapper after the other wrappers which are designed for models which don't follow the sklearn convention themselves, applying the same mechanism. For the sklearn models, one could indeed use composition.