Multiprocess backend does not work well with nested multiprocessing
bouthilx opened this issue · 0 comments
bouthilx commented
When training models with pytorch using multi-worker data loaders is generally necessary for efficient data loading. Current multi-process executor in Oríon does not support running multi-process inside parallel workers (which are sub processes spawned using python's multi-process module). This is very constraining and should be fixed.
Example of stack trace reported:
Traceback (most recent call last):
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 25, in _couldpickle_exec
result = function(*args, **kwargs)
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/client/runner.py", line 122, in _optimize
return fct(**unflatten(kwargs))
File "main.py", line 112, in run
signal = task.run()
File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/tasks/task.py", line 50, in run
return self.trainer.train(
File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/trainers/single_trainer.py", line 224, in train
train_loader_iter = iter(self.loaders["train"])
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 444, in __iter__
return self._get_iterator()
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__
w.start()
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/process.py", line 118, in start
assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 227, in async_get
results.append(AsyncResult(future, future.get()))
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 54, in get
r = self.future.get(timeout)
File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
AssertionError: daemonic processes are not allowed to have children