Next development steps and backwards-incompatible changes
alegonz opened this issue · 1 comments
I don't know how many people is using this library, but from now on I'll make an effort to post in advance any new features and changes that I plan to make to the API in this thread.
Please be aware that baikal is still a young project and it might be subject to backwards-incompatible changes. The major version (following semver) is still zero, meaning that any changes might happen at anytime. Currently there is no deprecation policy. I don't think there is a significant user base yet, so development will be rather liberal introducing backward-incompatible changes if they are required to make the API easier to use, handle important use-cases, less error-prone, etc. That said, I'll make an effort to keep the backward-incompatible changes to a minimum.
If you are using baikal (thank you!) I'd suggest doing the following:
- Pin your version in your
requirements.txt
orsetup.py
/setup.cfg
'sinstall_requires
, etc. You might want to pin using the~=
operator to allow updates of patches (only bugfixes). - Give feedback! :)
- Subscribe to this issue (button is on the right), so you can be warned about any future changes.
Comments and discussions are of course welcome in this thread :)
(This thread was inspired by the one used by the trio project)
New features and changes planned for 0.3.0
1) Specify function
and trainable
arguments when calling the step on inputs, and rename function
to compute_func
.
This will be a backwards-incompatible change, necessary for the other two changes described below.
The idea is that instead of doing this:
step = LogisticRegression(function="predict_proba", trainable=True)(x, y_t)
you would do
step = LogisticRegression()(x, y_t, compute_func="predict_proba", trainable=True)
so that it is possible to call the same step (a shared step) with different behaviors on different inputs. So, for example, learned target transformations would be expressed as:
x = Input()
y_t = Input()
scaler = StandardScaler()
y_t_transformed = scaler(y_t, compute_func="transform", trainable=True)
y_p_transformed = LinearRegression()(x, y_t_transformed)
y_p = scaler(y_p_transformed, compute_func="inverse_transform", trainable=False) # reuse parameters fitted above
Both compute_func
and trainable
would be keyword-only arguments. This is to make client code more readable and to allow baikal to change the order in the future without breaking existing code.
The renaming of function
to compute_func
is to be consistent with the future fit_compute_func
argument described below.
2) Make steps shareable.
(See Issue #11 for the original discussion.)
The idea is that steps could be called an arbitrary number of times on different inputs with different behaviors at each call (e.g. trainable + transform function in the first call, non-trainable + inverse transform function in the second call).
The motivation is to allow reusing steps and their learned parameters on different inputs (similar to what Keras do with shared layers). Having shared steps is particularly important for reusing learned transformations on targets like in the example above. Also, this would allow reusing steps like Lambda
to apply the same computation (e.g. casting data types, dropping dimensions) on several inputs. Currently, calling a step with new inputs will override the connectivity of the first call, so this is not possible yet. One could perhaps work around this limitation by having a step with pointers to the parameters of an earlier step, but that might end up being unwieldy.
3) Add API support for fit_transform
and fit_predict
.
(See Issue #13 for the original discussion.)
The motivation is three-fold:
- Make custom fitting protocols, such as the common stacking protocol that uses out-of-fold predictions in the first level, possible. (The current stacked classifier example is a naive example that does not use OOF predictions and thus the second level classifier is prone to prefer an overfitted classifier from the first level).
- Allow the use of transductive estimators (e.g.
sklearn.manifold.TSNE
,sklearn.cluster.AgglomerativeClustering
). - Leverage estimators that implement a
fit_transform
more efficient than callingfit
andtransform
separately.
Currently the above is not possible because Model.fit
runs each step's fit
and predict
/transform
method separately, making it impossible to control them jointly. To make this kind of training protocol possible, I plan to add a fit_compute
API that allows you to have more control on the computation at fit time (*1). The idea is that, for example, in the case of a stacked classifier, you would define the method in the first-level steps like this:
def fit_compute(self, X, y, **fit_params):
# 1) Train the step as usual, using the full data.
# This fits the parameters that will be used at inference time.
super().fit(X, y, **fit_params)
# 2) Compute cross-validated predictions. These will be passed
# to the classifier in the next level to be used as features.
y_p_cv = cross_val_predict(self, X, y, cv=self.cv)
return y_p_cv
and Model.fit
will give precedence to this method when fitting the step. This should allow defining the stacked model once and fitting it with a single call to model.fit
, without having to build an train the first and second stages separately.
Analogously to compute_func
, a fit_compute_func
argument will also be added to Step.__call__
so client code can specify arbitrary methods.
fit_transform
(transformers) and fit_predict
(classifiers/regressors) are special cases of fit_compute
and will be detected and used by Model.fit
if the step implements either.
New features and changes planned for 0.4.0 0.5.0 and later
- Make a custom
GridSearchCV
API, based on the original scikit-learn implementation, that can handle baikal models with multiple inputs and outputs natively. - Add parallelization to
Model.fit
andModel.predict
(using joblibParallel
API). - Add caching of intermediate results to
Model.fit
andModel.predict
(using joblibMemory
API). - Add support for steps that can take extra options in their predict method.
- Grow the merge steps module and add support for data structures other than numpy arrays (e.g. pandas dataframes). Some steps that could be added are:
- Single array aggregation (sum, average, maximum, minimum, etc).
- Element-wise aggregation of multiple arrays.