Using baikal steps for applying transformations without Model
jrderuiter opened this issue · 5 comments
In some of our projects, we have ETL/preprocessing pipelines that take multiple inputs and produce a single output dataset. In some current implementations we've been using the scikit-learn transformer/pipeline API to transform individual datasets before then combining them with a join/merge and applying some (optional) postprocessing on the merged dataset using another sklearn pipeline.
A drawback of this approach is that we have to intersperse our transformer steps with merges, which don't fit in the sklearn pipeline API. Baikal would seem like a nice approach for defining (non-linear) transfomer pipelines that take multiple inputs, but it doesn't seem as if you can use baikal for only performing transformations (e.g. .transform(..) in the sklearn API).
Am I missing something in the API, or would this be something that might be interesting to include for future development?
Hi there.
baikal can handle not only transformations but also predictions, so you can make non-linear pipelines combining both (the example in the README shows a pipeline that does that). By default, baikal will detect and use either predict
or transform
(if the class implements either), but you can specify any function you like via the function
argument when instantiating the step. For example:
# Assume you have a class _MyClass that implements
# some_method that does some interesting computation
class _MyClass:
def __init__(self, ...):
...
def some_method(self, X):
# calculate y from X
return y
# Make the step from _MyClass
MyClass = make_step(_MyClass)
x = Input()
y = MyClass(function="some_method", name="myclass")(x)
model = Model(x, y)
# When doing model.predict, the myclass step will apply some_method on x
I wrote the example above based on the API of 0.2.0. The upcoming 0.3.0 version that I'm planning to release soon, however, will introduce a backwards-incompatible API, but it will allow you to reuse steps on different inputs and specify a different function in each case. This is useful, for example, for applying down in the pipeline transformations that were learned up in the pipeline (see the transformed_target
example in the master branch). I give more details about 0.3.0 in Issue #16.
Thanks for the example! That probably does do what I want then, but the call to model.predict seems a bit contrived if I'm only using baikal to do transformations. Maybe a pipeline.transform
method would seem a bit more natural?
Yes, that's a valid point. It is weird to call predict
on a model that is composed entirely of transformer steps. But if transform
would be implemented, then you have the opposite problem: how would that method behave for models that have both transformers and predictors? When I defined the API I picked predict
because 1) it seemed the least weird, 2) it is similar to sklearn's Pipeline
(which does not have a pipeline.transform
either) and to Keras' Model.predict
so people would be more familiar with it.
I guess that you want to compose several transformers in models that are further composed into bigger transformer models, so having Model.transform
would be convenient and more readable. In that case you could subclass from Model
to add the behavior specific for your application:
# Written on 0.2.0. In 0.3.0 this would be written slightly different.
class TransformerModel(baikal.Model):
def transform(self, X, outputs=None):
# Or you could also override `Model._build` and add this check there
if not all(step.function == step.transform for step in self.graph)
raise RuntimeError("All steps must be transformers")
return self.predict(X, outputs=outputs)
Hmm, I didn't realise that the Sklearn pipeline also doesn't have a transform, good point. It does have a fit_transform though.
Closing due to inactivity. If you have any other questions feel free to re-open.