`n_jobs=-1` is not converted to use all cores, instead silently fails to run any evaluations

Question

`n_jobs=-1` is not converted to use all cores, instead silently fails to run any evaluations

iXanthos opened this issue 3 years ago · 5 comments

Greetings,

I am trying to run GAMA on a small dataset (180 samples, 8 features) with the default settings, but I receive the BrokenPipeError
More specifically, the code I am running:

from gama import GamaClassifier
print("Started GAMA demo...")

   automl = GamaClassifier(max_total_time=300, n_jobs=-1)
   automl.fit(train_data, train_target)
   predictions = automl.predict(test_data)

and the error I get:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
	send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
	self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
	self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
	n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
	send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
	self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
	self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
	n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "AutoML_init_tester.py", line 113, in <module>
	gama_demo(train_data, train_target, test_data, test_target)
  File "AutoML_init_tester.py", line 83, in gama_demo
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
	send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
	self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
	self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
	n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
	automl.fit(train_data, train_target)
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/GamaClassifier.py", line 134, in fit
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
	send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
	self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
	self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
	n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
	super().fit(x, y, *args, **kwargs)
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
	self.model = self._post_processing.post_process(
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 26, in post_process
	self._selected_individual = selection[0]
IndexError: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
	send_bytes(obj)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
	self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
	self._send(header + buf)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
	n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Does the error show no model could be fitted in the allotted time (5 minutes)? Or does it mean something else?

Regards,
IX

Answer 1 · 2021-07-30T12:50:24.000Z

Hi, thanks for reporting your issue! The BrokenPipeError does not necessarily mean that no model could be fit. It's typically an issue when shutting down the subprocesses which evaluate pipelines, and can normally be safely ignored (we're working on making sure it doesn't happen, though).

It's actually this little part which likely prevented predictions from being made:

	super().fit(x, y, *args, **kwargs)
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
	self.model = self._post_processing.post_process(
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 26, in post_process
	self._selected_individual = selection[0]
IndexError: list index out of range

Could you provide us with the data you used to generate this behavior?

Answer 2 · 2021-08-02T08:20:36.000Z

Unfortunately I cannot provide you with the data as they are proprietary. But I have tested the same data (same splits and all) in other AutoML frameworks and so far only GAMA seems to produce this error.

It's actually this little part which likely prevented predictions from being made:

	super().fit(x, y, *args, **kwargs)
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/gama.py", line 549, in fit
	self.model = self._post_processing.post_process(
  File "/home/ixanthos/Documents/gama_venv/lib/python3.8/site-packages/gama/postprocessing/best_fit.py", line 26, in post_process
	self._selected_individual = selection[0]
IndexError: list index out of range

I also saw this index error, does it mean that there was no fitted model or is it something else?
Do you think this error can be fixed if I increase the max_total_time?

Regards,
Iordanis

Answer 3 · 2021-08-02T08:46:09.000Z

Yes, I suspect that no pipeline has been successfully evaluated. This could mean that either there's something in the input data that GAMA doesn't deal with, or that it simply did not have enough time. Given how small the dataset is, I would rather expect the former. The logs (gama.log , evaluations.log) might reveal more specifically what the issue is, if you can share those.

Answer 4 · 2021-08-02T09:04:56.000Z

I am pasting the logs:

evaluations.log

id;pid;t_start;t_wallclock;t_process;score;pipeline;error;parent0;parent1;origin

gama.log

[2021-07-30 02:45:31,327 - gama.gama] Using GAMA version 21.0.0.
[2021-07-30 02:45:31,327 - gama.gama] INIT:GamaClassifier(scoring=neg_log_loss,regularize_length=True,max_pipeline_length=None,random_state=None,max_total_time=300,max_eval_time=None,n_jobs=-1,max_memory_mb=None,verbosity=30,search=AsyncEA(),post_processing=BestFitPostProcessing(),output_directory=gama_1b8b00a8-fcb2-4c7f-9af6-f6de86c35809,store=logs)
[2021-07-30 02:45:31,328 - gama.utilities.generic.timekeeper] START: preprocessing default
[2021-07-30 02:45:31,331 - gama.utilities.generic.timekeeper] STOP: preprocessing default after 0.0026s.
[2021-07-30 02:45:31,331 - gama.utilities.generic.timekeeper] START: search AsyncEA
[2021-07-30 02:45:31,339 - gama.utilities.generic.async_evaluator] Process 19297 starting -1 subprocesses.
[2021-07-30 02:45:31,339 - gama.search_methods.async_ea] Starting EA with new population.
[2021-07-30 02:50:00,379 - gama.utilities.generic.async_evaluator] Signaling 0 subprocesses to stop.
[2021-07-30 02:50:00,383 - gama.gama] Search phase evaluated 0 individuals.
[2021-07-30 02:50:00,384 - gama.utilities.generic.timekeeper] STOP: search AsyncEA after 269.0520s.
[2021-07-30 02:50:00,384 - gama.utilities.generic.timekeeper] START: postprocess BestFitPostProcessing

Regards

Answer 5 · 2021-08-02T09:31:38.000Z

It looks like n_jobs=-1 is broken and no longer correctly sets the cpu count, which results in no evaluation processes being run (and thus no individuals being evaluated). I'll patch it but because I'm having vacation I might not do an immediate PyPI release. In the meanwhile please set n_jobs to the number of cores you want to use explicitly (or leave it None to use half of your cores). Thanks again for raising the issue and providing me with the details to find the bug.

The bug should be fixed in 21.0.1 release (not yet published).