Epistimio/orion

Multiprocess backend does not work well with nested multiprocessing

bouthilx opened this issue · 0 comments

When training models with pytorch using multi-worker data loaders is generally necessary for efficient data loading. Current multi-process executor in Oríon does not support running multi-process inside parallel workers (which are sub processes spawned using python's multi-process module). This is very constraining and should be fixed.

Example of stack trace reported:

Traceback (most recent call last):
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 25, in _couldpickle_exec
    result = function(*args, **kwargs)
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/client/runner.py", line 122, in _optimize
    return fct(**unflatten(kwargs))
  File "main.py", line 112, in run
    signal = task.run()
  File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/tasks/task.py", line 50, in run
    return self.trainer.train(
  File "/home/mila/s/schmidtv/ocp-project/ocp-drlab/ocpmodels/trainers/single_trainer.py", line 224, in train
    train_loader_iter = iter(self.loaders["train"])
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 444, in __iter__
    return self._get_iterator()
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__
    w.start()
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/process.py", line 118, in start
    assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 227, in async_get
    results.append(AsyncResult(future, future.get()))
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/site-packages/orion/executor/multiprocess_backend.py", line 54, in get
    r = self.future.get(timeout)
  File "/home/mila/s/schmidtv/.conda/envs/ocp-a100/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
AssertionError: daemonic processes are not allowed to have children