scikit-learn-contrib/category_encoders

Memory increase of WOEEncoder for newer category_encoders version

Piecer-plc opened this issue · 2 comments

Memory increase of WOEEncoder for category_encoders version >=2.0.0

Hi, I noticed another memory issue with WOEEncoder. I have submitted the same bug before in #335, the difference between two bugs is the different encoder methods used and different datasets. In order to distinguish between the two encoder APIs, I resubmitted a new bug report.

Expected Behavior

Similar memory usage

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, weight_enc.fit(train[weight_encode], train['target']) memory usage increase from 58MB to 206MB.

Memory(MB) Version
209 2.3.0
209 2.2.2
209 2.1.0
209 2.0.0
58 1.3.0

Steps to Reproduce the Problem

Step 1: Download the dataset

train.zip

Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd 
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
columns = [x for x in train.columns if x != 'target']
object_col_label = ['bin_0','bin_1','bin_2','bin_3','bin_4']
one_hot_encode = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4']
target_encode = ['nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']
weight_encode = target_encode + ['ord_4', 'ord_5' ,'ord_3'] + one_hot_encode + object_col_label
import category_encoders as ce
weight_enc = ce.woe.WOEEncoder(cols=weight_encode)
import tracemalloc
tracemalloc.start()
weight_enc.fit(train[weight_encode], train['target'])
current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0
Platform: ubuntu 16.4
OS : Ubuntu
CPU : Intel(R) Core(TM) i9-9900K CPU
GPU : TITAN V

glevv commented

Happens because WOE relies on Ordinal encoding and OE copies input data

X = X_in.copy(deep=True)