jeongyoonlee/Kaggler

LabelEncoder Usage

r0f1 opened this issue · 2 comments

r0f1 commented

Hi,
The following piece of code throws an error. Why?

from kaggler.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(pd.Series([1,1,1,2,2,2,3,3,3]))

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
c:\Users\semic\Desktop\dsi19-oct\main.py in 
      1 le = LabelEncoder()
----> 2 le.fit_transform(pd.Series([1,1,1,2,2,2,3,3,3]))

~\Anaconda3\lib\site-packages\kaggler\preprocessing\categorical.py in fit_transform(self, X, y)
    121         """
    122 
--> 123         self.label_encoders = [None] * X.shape[1]
    124         self.label_maxes = [None] * X.shape[1]
    125 

IndexError: tuple index out of range

Unlike sklearn.preprocessing's Label Encoder which provides encoded labels for an array, fit_transform() in this package takes pandas.DataFrame as input and encode all the columns in it, that's why you've got the index out of range error

    def fit_transform(self, X, y=None):
        """Encode categorical columns into label encoded columns
        Args:
            X (pandas.DataFrame): categorical columns to encode
        Returns:
            (pandas.DataFrame): label encoded columns
        """

        self.label_encoders = [None] * X.shape[1]
        self.label_maxes = [None] * X.shape[1]

        for i, col in enumerate(X.columns):
            self.label_encoders[i], self.label_maxes[i] = \
                self._get_label_encoder_and_max(X[col])

            X.loc[:, col] = (X[col].fillna(NAN_INT)
                             .map(self.label_encoders[i])
                             .fillna(0))

        return X
r0f1 commented

Ok thanks!