Using baikal steps for applying transformations without Model

Question

Using baikal steps for applying transformations without Model

jrderuiter opened this issue 5 years ago · 5 comments

In some of our projects, we have ETL/preprocessing pipelines that take multiple inputs and produce a single output dataset. In some current implementations we've been using the scikit-learn transformer/pipeline API to transform individual datasets before then combining them with a join/merge and applying some (optional) postprocessing on the merged dataset using another sklearn pipeline.

A drawback of this approach is that we have to intersperse our transformer steps with merges, which don't fit in the sklearn pipeline API. Baikal would seem like a nice approach for defining (non-linear) transfomer pipelines that take multiple inputs, but it doesn't seem as if you can use baikal for only performing transformations (e.g. .transform(..) in the sklearn API).

Am I missing something in the API, or would this be something that might be interesting to include for future development?

Answer 1 · 2020-01-16T13:12:32.000Z

Hi there.

baikal can handle not only transformations but also predictions, so you can make non-linear pipelines combining both (the example in the README shows a pipeline that does that). By default, baikal will detect and use either predict or transform (if the class implements either), but you can specify any function you like via the function argument when instantiating the step. For example:

# Assume you have a class _MyClass that implements
# some_method that does some interesting computation

class _MyClass:
    def __init__(self, ...):
        ...

    def some_method(self, X):
        # calculate y from X
        return y

# Make the step from _MyClass
MyClass = make_step(_MyClass)

x = Input()
y = MyClass(function="some_method", name="myclass")(x)
model = Model(x, y)

# When doing model.predict, the myclass step will apply some_method on x

I wrote the example above based on the API of 0.2.0. The upcoming 0.3.0 version that I'm planning to release soon, however, will introduce a backwards-incompatible API, but it will allow you to reuse steps on different inputs and specify a different function in each case. This is useful, for example, for applying down in the pipeline transformations that were learned up in the pipeline (see the transformed_target example in the master branch). I give more details about 0.3.0 in Issue #16.

Answer 2 · 2020-01-17T09:51:10.000Z

Thanks for the example! That probably does do what I want then, but the call to model.predict seems a bit contrived if I'm only using baikal to do transformations. Maybe a pipeline.transform method would seem a bit more natural?

Answer 3 · 2020-01-19T01:20:50.000Z

Yes, that's a valid point. It is weird to call predict on a model that is composed entirely of transformer steps. But if transform would be implemented, then you have the opposite problem: how would that method behave for models that have both transformers and predictors? When I defined the API I picked predict because 1) it seemed the least weird, 2) it is similar to sklearn's Pipeline (which does not have a pipeline.transform either) and to Keras' Model.predict so people would be more familiar with it.

I guess that you want to compose several transformers in models that are further composed into bigger transformer models, so having Model.transform would be convenient and more readable. In that case you could subclass from Model to add the behavior specific for your application:

# Written on 0.2.0. In 0.3.0 this would be written slightly different.

class TransformerModel(baikal.Model):
    def transform(self, X, outputs=None):
        # Or you could also override `Model._build` and add this check there
        if not all(step.function == step.transform for step in self.graph)
            raise RuntimeError("All steps must be transformers")
       
        return self.predict(X, outputs=outputs)

Answer 4 · 2020-01-20T08:50:15.000Z

Hmm, I didn't realise that the Sklearn pipeline also doesn't have a transform, good point. It does have a fit_transform though.

Answer 5 · 2020-11-15T09:18:44.000Z

Closing due to inactivity. If you have any other questions feel free to re-open.