feature-engine/feature_engine

Correlation-based feature selection is column-order dependent

Closed this issue · 5 comments

Correlation-based feature selection makes use of a for-loop running over the features, and checking whether or not they are collinear to another feature (see for example line 183 of selection.drop_correlated_features.py). In my view, this is problematic, since it introduces a dependency on column-ordering in the outcome of the selection.

As an example, consider the following feature set:

X = pd.DataFrame()
X['a'] = [1, -1, 0, 0, 0, 0]
X['b'] = [0, 0, 1, -1, 1, -1]
X['c'] = [0, 0, 1, -1, 0, 0]
X['d'] = [0, 0, 0, 0, 1, -1]

These features are constructed in such a way that, given a correlation-threshold of 0.7,
a is uncorrelated, b is collinear to c and d, and c and d are collinear only to b.

If we run

DropCorrelatedFeatures(threshold = 0.7).fit(X)

the features [‘a’, ‘b’] will be selected. On the other hand, if we run

DropCorrelatedFeatures(threshold = 0.7).fit(X[‘a’, ‘c’, ‘b’, ‘d’])

the featuers [‘a’,’c’,’d’] will be selected instead.

Similarly, SmartCorrelatedSelection() is also not invariant under such reordering of the columns of X.

As a solution, ideally, whatever feature set is selected satisfies (at least) the following two constraints, in addition to being independent of column-order:

  1. The set of features selected contains no features collinear to one another (by definition)
  2. Given such a set of features, there is no feature that can be added to the set without introducing collinearity between two features

As a first attempt, I tried to construct the full set of sets that satisfy these two constraints. Unfortunately, this is computationally intensive and quickly becomes impractical for datasets with > 100 features.

I am not aware of any faster algorithm that is guaranteed to return a feature set that satisfies both constraints. But a minimal improvement would be to remove features based on a rank-ordering of the features rather than column order: what I have in mind for such a rank-ordering would be number of features the feature is collinear with, information value, correlation with the target, or (like in SmartCorrelatedSelection()), estimator-based. Such a rank-ordering can be an input variable, with a certain default approach.

I would be happy to further work on this issue once methodology is clear, if this is of interest to others.

Hi @dlaprins

Thank you for raising the issue. I am aware of the problem. It was also raised in similar issues before by @FedericoMontana and @WhyYouNeedMyUsername

#570
#327

The thing is, I am still unsure about how to resolve this issue, in a way that is computationally possible.

I am open to suggestions / ideas.

glevv commented

To make this estimator consistent we can sort columns before any transformations. For example we can sort them alphabetically or by some sort of variability (stdev, number of unique values) or by other metric. It won't fix the underlying issue of combinatory explosion, but it will make it consistent.

Sounds good! Thank you!

We can give users the option to chose how to sort:

  • alphabetically
  • std
  • number of na

Who would like to take the lead for this issue?

I'd be happy to pick this up.

Awesome @dlaprins ! Thank you!