Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using baikal steps for applying transformations without Model #23

Closed
jrderuiter opened this issue Jan 15, 2020 · 5 comments
Closed

Using baikal steps for applying transformations without Model #23

jrderuiter opened this issue Jan 15, 2020 · 5 comments

Comments

@jrderuiter
Copy link

In some of our projects, we have ETL/preprocessing pipelines that take multiple inputs and produce a single output dataset. In some current implementations we've been using the scikit-learn transformer/pipeline API to transform individual datasets before then combining them with a join/merge and applying some (optional) postprocessing on the merged dataset using another sklearn pipeline.

A drawback of this approach is that we have to intersperse our transformer steps with merges, which don't fit in the sklearn pipeline API. Baikal would seem like a nice approach for defining (non-linear) transfomer pipelines that take multiple inputs, but it doesn't seem as if you can use baikal for only performing transformations (e.g. .transform(..) in the sklearn API).

Am I missing something in the API, or would this be something that might be interesting to include for future development?

@alegonz
Copy link
Owner

alegonz commented Jan 16, 2020

Hi there.

baikal can handle not only transformations but also predictions, so you can make non-linear pipelines combining both (the example in the README shows a pipeline that does that). By default, baikal will detect and use either predict or transform (if the class implements either), but you can specify any function you like via the function argument when instantiating the step. For example:

# Assume you have a class _MyClass that implements
# some_method that does some interesting computation

class _MyClass:
    def __init__(self, ...):
        ...

    def some_method(self, X):
        # calculate y from X
        return y

# Make the step from _MyClass
MyClass = make_step(_MyClass)

x = Input()
y = MyClass(function="some_method", name="myclass")(x)
model = Model(x, y)

# When doing model.predict, the myclass step will apply some_method on x

I wrote the example above based on the API of 0.2.0. The upcoming 0.3.0 version that I'm planning to release soon, however, will introduce a backwards-incompatible API, but it will allow you to reuse steps on different inputs and specify a different function in each case. This is useful, for example, for applying down in the pipeline transformations that were learned up in the pipeline (see the transformed_target example in the master branch). I give more details about 0.3.0 in Issue #16.

@jrderuiter
Copy link
Author

Thanks for the example! That probably does do what I want then, but the call to model.predict seems a bit contrived if I'm only using baikal to do transformations. Maybe a pipeline.transform method would seem a bit more natural?

@alegonz
Copy link
Owner

alegonz commented Jan 19, 2020

Yes, that's a valid point. It is weird to call predict on a model that is composed entirely of transformer steps. But if transform would be implemented, then you have the opposite problem: how would that method behave for models that have both transformers and predictors? When I defined the API I picked predict because 1) it seemed the least weird, 2) it is similar to sklearn's Pipeline (which does not have a pipeline.transform either) and to Keras' Model.predict so people would be more familiar with it.

I guess that you want to compose several transformers in models that are further composed into bigger transformer models, so having Model.transform would be convenient and more readable. In that case you could subclass from Model to add the behavior specific for your application:

# Written on 0.2.0. In 0.3.0 this would be written slightly different.

class TransformerModel(baikal.Model):
    def transform(self, X, outputs=None):
        # Or you could also override `Model._build` and add this check there
        if not all(step.function == step.transform for step in self.graph)
            raise RuntimeError("All steps must be transformers")
       
        return self.predict(X, outputs=outputs)

@jrderuiter
Copy link
Author

Hmm, I didn't realise that the Sklearn pipeline also doesn't have a transform, good point. It does have a fit_transform though.

@alegonz
Copy link
Owner

alegonz commented Nov 15, 2020

Closing due to inactivity. If you have any other questions feel free to re-open.

@alegonz alegonz closed this as completed Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants