scikit-learn-contrib/category_encoders

ValueError: `X` and `y` both have indexes, but they do not match.

RNarayan73 opened this issue · 4 comments

Expected Behavior

When running any of the category encoders e.g. TargetEncoder() within a pipeline through permutation_test_score() it errors out with the above message.
The error occurs in the convert_inputs() function which checks for
if any(X.index != y.index):
before raising the error.

Actual Behavior

The error is not correct and shouldn't occur.
When I ran the same check above on my input X (dataframe) and y (series), the error doesn't occur.

In fact, when I load input data, after splitting the data into X and y, and after label encoding y, I explicitly convert it into a pd.Series and assign it the X.index, so they are in fact identical.

If in contrast, I do not convert the label encoded y into a pd.Series and leave it as an ndarray, then this error doesn't occur!

Also, note that the same pipeline when fitted with the same X, y df and series works absolutely fine.

Steps to Reproduce the Problem

See an example of my pipeline below:
image

  1. Create an arbitrary pipeline as follows:
from sklearn.linear_model import SGDClassifier
from category_encoders import TargetEncoder

test_pipe = Pipeline([('enc', TargetEncoder()), ('clf', SGDClassifier(loss='log_loss'))])
  1. Run
score, perm_scores, pvalue = permutation_test_score(test_pipe, X, y)

Specifications

  • Version: 2.5.1.post0
  • Platform: Python 3.10.8
  • Subsystem: Pandas 1.5.1

permutation_test_score permutes the targets, but when it's a frame it shuffles the index along with the data, and so when the encoder sees (X, y_shuffled) their indexes no longer line up. We hadn't thought of this use-case when we decided to error on mismatched indexes; we could downgrade to a warning, or just suggest to pass y as a numpy array for this case.

I feel that we are justified in throwing an error: the rows not matching is indicative that something is "wrong," but in this case you're setting up the wrong rows on purpose to see how performance degrades. Maybe the ideal solution is for sklearn to drop from pandas to numpy, or to replace the original index, inside permutation_test_score (it looks like that's the only place where they use _shuffle, so while maybe confusing semantically, it should work out), but that seems like an odd ask.

I agree with @bmreiniger here, but I also agree that it's not nice that the library raises an error here.
As Ben says, the error is raised because we want to prevent people supplying data frames and series with non matching indices because that causes a lot of confusion and leads to undefined behaviour. That also the reason why it works for np arrays. I like the idea of solving it in sklearn in particular.
Should I raise an issue with them?
In the meantime I think it would make sense to include some additional hints in the error message like if you are shuffling your input data on purpose (e.g. via permutation_test_score) use np arrays instead of data frames / series. Would that be ok?

I've added the suggested warning in #397

Thanks, @bmreiniger and @PaulWestenthanner
I see your reasoning and agree that for the scope of category_encoders, it would be incorrect to allow shuffled y. The permutation_test_score would probably be the right place to deal with this.