SoftwareAG/nyoka

Losing input features

1samwatkins3005 opened this issue · 3 comments

When using pipeline and dataframemapper transformations, features are lost. In the example below, car name is vectorized, but all other variables are omitted from the model thereafter. How should one apply transformations, such as TfidVectorizer, yet still keep all other input features?

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeRegressor

df = pd.read_csv('auto-mpg.csv')
X = df.drop(['mpg'],axis=1)
y = df['mpg']

features = [name for name in df.columns if name not in ('mpg')]
target = 'mpg'

pipeline_obj = Pipeline([
('mapper', DataFrameMapper([
('car name', TfidfVectorizer())
])),
('model',DecisionTreeRegressor())
])

pipeline_obj.fit(X,y)

Hi @1samwatkins3005 , when we use DataFrameMapper, only those features are forwarded to the model which are used inside the DataFrameMapper. So in your example it filters out everything and keeps 'car name' only.

To keep all the input features, you need to use those feature names inside DataFrameMapper without any Transformer. See the code below -

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeRegressor

df = pd.read_csv('auto-mpg.csv')
X = df.drop(['mpg'],axis=1)
y = df['mpg']

features = [name for name in df.columns if name not in ('mpg')]
normal_features = [name for name in features if name != "car name"] # <----
target = 'mpg'

pipeline_obj = Pipeline([
            ('mapper', DataFrameMapper([
                        ('car name', TfidfVectorizer()),
                        (normal_features, None)  # <----
              ])),
           ('model',DecisionTreeRegressor())
])

pipeline_obj.fit(X,y)

I hope this answers your question.

This has worked perfectly thankyou.

Seperate issue, do you have any plans to include catboost in your package? https://catboost.ai/

Cheers!

Currently catboost is not in our pipeline. We shall implement it in near future. Thanks!