scikit-learn-contrib/category_encoders

[new features]: Quantile Encoder

cmougan opened this issue · 3 comments

Implementation of Quantile Encoder from the publication (https://arxiv.org/abs/2105.13783)

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems
Carlos Mougan, David Masip, Jordi Nin, Oriol Pujol

Hi cmougan.Thanks for sharing the paper.I am trying to create a rental price avm model, but have categorical values with high cardinality. I am going through the paper, having slight difficulty in grasping the methodology.Is there a coded solution anywhere for this?

Hi @Zainny1234! The usage of this encoder follows the same structure than the rest of category_encoders packages.

    >>> from category_encoders import *
    >>> import pandas as pd
    >>> from sklearn.datasets import load_boston
    >>> bunch = load_boston()
    >>> y = bunch.target
    >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
    >>> enc = QuantileEncoder(cols=['CHAS', 'RAD']).fit(X, y)
    >>> numeric_dataset = enc.transform(X)
    >>> print(numeric_dataset.info())
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 506 entries, 0 to 505
    Data columns (total 13 columns):
    CRIM       506 non-null float64
    ZN         506 non-null float64
    INDUS      506 non-null float64
    CHAS       506 non-null float64
    NOX        506 non-null float64
    RM         506 non-null float64
    AGE        506 non-null float64
    DIS        506 non-null float64
    RAD        506 non-null float64
    TAX        506 non-null float64
    PTRATIO    506 non-null float64
    B          506 non-null float64
    LSTAT      506 non-null float64
    dtypes: float64(13)
    memory usage: 51.5 KB
    None

While the PR does not get accepted, you can use the package in

from sktools import QuantileEncoder

added in PR #303