alteryx/predict-remaining-useful-life

How to implement Featuretools into my ML Process without data leakage?

Opened this issue · 0 comments

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.

Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In your FAQ, youare giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.

Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.

Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.

It is the safest option, with time and complexity disadvantages.

Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at your Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.

For example;
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb

Actually if you as the developers of the project think like that, I could give it a chance with whole data. Don't you think there is a leakage risk with the approach you are using at these Taxi Trip Duration examples?

What do you think, I would love to hear about your intuition on FeatureTools.