scikit-learn-contrib/category_encoders

Quadratic time intersection on Pandas categories

willsthompson opened this issue · 2 comments

Expected Behavior

Using Pandas CategoryDType columns with OrdinalEncoder do not incur a performance penalty.

Actual Behavior

Pandas' internal categories are intersected with your computed categories in quadratic time, here

# Avoid using pandas category dtype meta-data if possible, see #235, #238.
if X[col].dtype.ordered:
categories = [c for c in X[col].dtype.categories if c in categories]

Steps to Reproduce the Problem

  1. Create a Series with a large number of categories, e.g.
categories = [f"Cat{i}" for i in range(10000)]
series = pd.Series(
    categories,
    pd.CategoricalDtype(categories=categories, ordered=True),
)
  1. Apply ordinal encoder to the series

Specifications

  • Version: 2.5.0
  • Platform:
  • Subsystem:

Proposed fix

This would be a very simple one line change:

categories = list(set(categories).intersection(set(X[col].dtype.categories)))

I'd be happy to get a PR together if this looks okay to you

Hi @willsthompson
thanks for pointing that out. Please go ahead and create a PR