[OTHER] Feedback on reduction to regression `get_sklearn_wrapper`
mloning opened this issue · 5 comments
Hi, following our conversion yesterday, I have two questions about the get_sklearn_wrapper
:
1. API: Why do you dynamically create the init and pass the class rather than simply using composition and passing the object?
forecaster = get_sklearn_wrapper(DummyRegressor, lags=3) # you do this
forecaster = get_sklearn_wrapper(DummyRegressor(), lags=3) # why not this?
- This would avoid the issues you seem to have with local class definitions and pickling
- This would allow you to pass composite models like pipelines or GridSearchCV into the wrapper
- I really like the dynamic init creator, but my intuition tells me it's misapplied here.
2. Algorithm: Why not use standard recursive strategy?
from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.dummy import DummyRegressor
import pandas as pd
import numpy as np
index = pd.date_range("2000", periods=13, freq="Y")
y_train = pd.Series(np.arange(10), index=index[:-3])
X_train = pd.DataFrame(index=y_train.index)
X_test = pd.DataFrame(index=index[-3:])
model = get_sklearn_wrapper(DummyRegressor, lags=3)
model.fit(X_train, y_train)
model.predict(X_test)
# >>> 2010-12-31 7.0
# >>> 2011-12-31 7.0
# >>> 2012-12-31 7.0
# you use the first 3 values as lagged variables, the DummyRegressor simply computes the mean of the rest
# so shouldn't the result be the following?
y_train.iloc[3:].mean() # >>> 6.0
# the problem seems to be in the way you generate the target series
# you increase the gap between lagged variables and target to match the length of
# forecasting horizon
X, y = model._transform_data_to_tsmodel_input_format(X_train, y_train, len(X_test))
pd.concat([X, pd.Series(y, index=X.index, name="y")], axis=1).head()
# lag_0 lag_1 lag_2 y
#5 2.0 1.0 0.0 5
#6 3.0 2.0 1.0 6
#7 4.0 3.0 2.0 7
#8 5.0 4.0 3.0 8
#9 6.0 5.0 4.0 9
Hope this helps!
@mloning ad the result of DummyRegressor output - the default strategy of DummyRegressor is mean (but mean of whole training dataset)
Parts from the DummyRegressor docstrings
Parameters
----------
strategy : str
Strategy to use to generate predictions.
* "mean": always predicts the mean of the training set
Examples
--------
>>> import numpy as np
>>> from sklearn.dummy import DummyRegressor
>>> X = np.array([1.0, 2.0, 3.0, 4.0])
>>> y = np.array([2.0, 3.0, 5.0, 10.0])
>>> dummy_regr = DummyRegressor(strategy="mean")
>>> dummy_regr.fit(X, y)
DummyRegressor()
>>> dummy_regr.predict(X)
array([5., 5., 5., 5.])
@therhaag any thoughts on the class versus object mentioned by Markus?
Hi @mloning - sorry for the delayed response. The reason for passing the class rather than using composition is to avoid the need for a custom get_params/set_params implementation which distinguishes between parameters of the wrapper and parameters of the class being wrapped. The sklearn implementation of these functions in base.py relies on discovering all possible parameters of the estimator from the signature of the constructor, so we need to expose those through the constructor of the wrapper. There may be a simpler way to do this, though, more close to how composite models behave - in fact, I remember looking into this option but I can't remember what was the difficulty with that, so I'd be happy to consider alternatives.
@therhaag but sklearn supports composition with nested parameters using their double underscore convention, I think that would be the cleaner solution in the end.