scikit-learn-contrib/category_encoders

BaseNEncoder.inverse_transform fails when column contains regex metacharacters

pimlock opened this issue · 0 comments

Expected Behavior

BaseNEncoder.inverse_transform() should work correctly with column names containing regex metacharacters, for example for column names such as: my_column (test), test [123], the characters ()[] will be interpreted as regex's capturing group and character range, but instead should be treated as literals.

See:

col_list = [col0 for col0 in out_cols if re.match(str(col)+'_\\d+', str(col0))]

Actual Behavior

Trying to inverse_transform(), when the input column contained regex metacharacter (e.g. ()) raises exception:

Traceback (most recent call last):
  File "site-packages/IPython/core/interactiveshell.py", line 3397, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-92-c30af6a1928b>", line 10, in <cell line: 10>
    inversed = enc.inverse_transform(transformed)
  File "site-packages/category_encoders/basen.py", line 268, in inverse_transform
    X = self.basen_to_integer(X, self.cols, self.base)
  File "site-packages/category_encoders/basen.py", line 358, in basen_to_integer
    insert_at = out_cols.index(col_list[0])
IndexError: list index out of range

Steps to Reproduce the Problem

from category_encoders import BaseNEncoder
import pandas as pd

col_name = "A (test)"
X = pd.DataFrame(data={col_name: ["A", "B", "A", "C"]})

enc = BaseNEncoder(cols=[col_name]).fit(X)

transformed = enc.transform(X)

# fails with `index 0 is out of bounds`
inversed = enc.inverse_transform(transformed)

Specifications

  • Version: 2.5.1
  • Platform: Any
  • Subsystem: Any