Request: skipping without error when there are no variables to transform

Question

Request: skipping without error when there are no variables to transform

david-cortes opened this issue 2 years ago · 4 comments

Transformers in this package have the nice functionality to automatically apply to all variables that are either numerical or categorical depending on what the transformer does if the list of variable names is not supplied.

Sometimes, one wants to perform automated feature selection as steps before or after some transformer, in which case if for example one has a transformer like MatchCategories and the selector drops all categorical variables, there will be an error later on in the pipeline as there won't be any columns for the transformer.

Would be nice if there could be an option to toggle off erroring on empty variable lists.

Answer 1 · 2023-01-20T19:24:52.000Z

@glevv what do you think about this?

Answer 2 · 2023-01-21T08:57:48.000Z

If I understand it correctly, this will only work for estimators that can transform without calling fit first, which is incompatible with sklearn notation.

Answer 3 · 2023-01-21T11:59:23.000Z

If I understand it correctly, this will only work for estimators that can transform without calling fit first, which is incompatible with sklearn notation.

Not really, since in a case in which there's no columns, a call to fit just needs to return the same object and a call to transform just needs to return the same data that is passed as input.

Answer 4 · 2023-01-21T12:42:04.000Z

At the moment, if for example, encoders find that the dataset has no categorical variable, they will raise an error, fail and not perform the encoding. If you set ignore_format=True, they will also encode numerical variables, but this is not what @david-cortes wants.

Numerical transformers will also raise an error and fail if they find no numerical variable in the dataset.

This was done intentionally. My idea when designing these transformers was to stop users from carrying out encoding methodologies to numerical variables, and numerical transformations to categorical variables, inadvertently..

As a clear example, with the SimpleImputer() if you set the strategy to "most_frequent", the transformer will impute both numerical and categorical variables with the mode. Whereas this method is actually suitable for categorical variables, and numerical variables should be encoded with the mean or the median. These is the type of behaviour that Feature-engine is designed to prevent.

Hence, if a categorical encoder encounters no categorical variable in the dataset, it will fail, because it does not have a suitable input for the transformation.

@david-cortes is asking that, instead of failing, they just pass. That is, if no categorical variable is found in the dataset, instead of failing, just carry out fit and transform without modifying the dataset.

My concern with that is that, most users will not go into the source code, and some don't even read the documentation. So, if we allow the transformers to pass and do nothing, the users might believe that the transformer worked, whatever that means. Whereas, if we raise an error, we are somehow encouraging them to think what might be going on.

@david-cortes is not the first one to request this. Someone else requested that for selectors. See #566 and a little related but not quite #567