Request: skipping without error when there are no variables to transform
david-cortes opened this issue · 4 comments
Transformers in this package have the nice functionality to automatically apply to all variables that are either numerical or categorical depending on what the transformer does if the list of variable names is not supplied.
Sometimes, one wants to perform automated feature selection as steps before or after some transformer, in which case if for example one has a transformer like MatchCategories
and the selector drops all categorical variables, there will be an error later on in the pipeline as there won't be any columns for the transformer.
Would be nice if there could be an option to toggle off erroring on empty variable lists.
If I understand it correctly, this will only work for estimators that can transform
without calling fit
first, which is incompatible with sklearn notation.
If I understand it correctly, this will only work for estimators that can
transform
without callingfit
first, which is incompatible with sklearn notation.
Not really, since in a case in which there's no columns, a call to fit
just needs to return the same object and a call to transform
just needs to return the same data that is passed as input.
At the moment, if for example, encoders find that the dataset has no categorical variable, they will raise an error, fail and not perform the encoding. If you set ignore_format=True
, they will also encode numerical variables, but this is not what @david-cortes wants.
Numerical transformers will also raise an error and fail if they find no numerical variable in the dataset.
This was done intentionally. My idea when designing these transformers was to stop users from carrying out encoding methodologies to numerical variables, and numerical transformations to categorical variables, inadvertently..
As a clear example, with the SimpleImputer()
if you set the strategy
to "most_frequent"
, the transformer will impute both numerical and categorical variables with the mode. Whereas this method is actually suitable for categorical variables, and numerical variables should be encoded with the mean or the median. These is the type of behaviour that Feature-engine is designed to prevent.
Hence, if a categorical encoder encounters no categorical variable in the dataset, it will fail, because it does not have a suitable input for the transformation.
@david-cortes is asking that, instead of failing, they just pass. That is, if no categorical variable is found in the dataset, instead of failing, just carry out fit and transform without modifying the dataset.
My concern with that is that, most users will not go into the source code, and some don't even read the documentation. So, if we allow the transformers to pass and do nothing, the users might believe that the transformer worked, whatever that means. Whereas, if we raise an error, we are somehow encouraging them to think what might be going on.
@david-cortes is not the first one to request this. Someone else requested that for selectors. See #566 and a little related but not quite #567