scikit-learn-contrib/category_encoders

EOF Error Raised while Calling HashingEncoders function

shi8tou opened this issue · 6 comments

Expected Behavior

HashingEncoders should encode the categorical columns successfully

Actual Behavior

Got EOF error issue while calling HashingEncoders function

Steps to Reproduce the Problem

  1. Packages installed on my laptop: category_encoders==2.6.0 & python==3.10.0

  2. Dataset is here:
    test_1.csv

  3. Run following code:

import pandas as pd
import category_encoders as ce

dataset = pd.read_csv('test_1.csv')
he = ce.HashingEncoder(cols=['purchase_address'], n_components=2)

dd = he.fit_transform(dataset)

dd.columns
  1. This code "dd = he.fit_transform(dataset)" will throw EOF Error.

Specifications

  • Version:
  • Platform:
  • Subsystem:

I can't recreate the issue in Colab, which is running python 3.10.12.

I got an email with a traceback, but it's not here; some wires crossed in github, or you deleted it, or...?
That sounded like an issue with the parallelization and maybe not enough memory/space for that, but I'm not an expert about it.

I also couldn't reproduce it on my local linux machine using category-encoders 2.6.0 and python 3.10 in a fresh conda environment.
As Ben pointed out for the hashing encoder there are differences with windows when it comes to multi-processing. Are you using windows or Linux/Mac

Thanks.
I am using Mac air with M2.

Here is the error I got:
`
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 269, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/test/test_dict.py", line 8, in
dd = he.fit_transform(dataset)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/base.py", line 848, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 315, in fit
X_transformed = self.transform(X, override_return_df=True)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 488, in transform
X = self._transform(X)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/hashing.py", line 174, in _transform
data_lock = multiprocessing.Manager().Lock()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 57, in Manager
m.start()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/managers.py", line 562, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "/test/test_dict.py", line 8, in
dd = he.fit_transform(dataset)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/base.py", line 848, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 315, in fit
X_transformed = self.transform(X, override_return_df=True)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/utils.py", line 488, in transform
X = self._transform(X)
File "/Users/sss/virtualenvs/functions/lib/python3.10/site-packages/category_encoders/hashing.py", line 174, in _transform
data_lock = multiprocessing.Manager().Lock()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/context.py", line 57, in Manager
m.start()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/managers.py", line 566, in start
self._address = reader.recv()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/connection.py", line 388, in _recv
raise EOFError
EOFError`

Thanks. Notice that the first traceback ends with an error from multiprocessing; the EOF is at the end of a second (identical?) traceback.

You might try a newer version of this package: #428 updated the hashing encoder significantly.

The same error shows up in StackOverflow, but I'm not sure how much it helps: https://stackoverflow.com/q/61931669/10495893

I could get access to an old macbook (still with an intel chip) but also could not reproduce the issue on that machine (using a fresh conda installation). Can you try version 2.6.3 as Ben suggests and see if that solves the issue?