The problem of multi-GPU training

Question

The problem of multi-GPU training

Closed this issue 3 years ago · 1 comments

Hi there,

When I training the model with multi-GPU by setting gpus=2 in pl.Trainer(), it throws an error:
TypeError: cannot pickle 'module' object.
How can I solve this problem? Thanks!

    ...
    trainer = pl.Trainer(
        deterministic = True,
        gpus = 2, # <---------
        checkpoint_callback = False,
        max_epochs = config.max_epoch,
        auto_lr_find = True,
        sync_batchnorm=True,
        # check_val_every_n_epoch = 1,
        val_check_interval = 0.25,
    )
    ...

python 3.8.3
torch '1.7.1+cu110'
Ubuntu 18.04.5 LTS

Global seed set to 19961206
Data List: data/test_adc.txt
Song Size: 12
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:01<00:00,  9.73it/s]
[W Context.cpp:69] Warning: torch.set_deterministic is in beta, and its design and  functionality may change in the future. (function operator())
/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:849: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(strategy="dp"|"ddp"|"ddp2")`. Setting `strategy="ddp_spawn"` for you.
  rank_zero_warn(
/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:147: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=False)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=False)`.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Traceback (most recent call last):
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data4/chengfang/.vscode-server/extensions/ms-python.python-2022.0.1814523869/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/data4/chengfang/.vscode-server/extensions/ms-python.python-2022.0.1814523869/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/data4/chengfang/.vscode-server/extensions/ms-python.python-2022.0.1814523869/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data4/chengfang/project/melodyExtraction/TONet/main.py", line 163, in <module>
    train()
  File "/data4/chengfang/project/melodyExtraction/TONet/main.py", line 97, in train
    trainer.fit(model, train_dataloader, test_dataloaders)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
    self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
    mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 148, in start_processes
    process.start()
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
    return Popen(process_obj)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/data4/chengfang/.conda/envs/melody/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'module' object

Answer 1 · 2022-02-11T04:16:17.000Z

Hi,

Sorry, our model does not support the multi-GPU because each TONet model can be trained in a single card.

To make our code support multi-GPU, you need to use the DDP(data distribute parallel). Simply it has three steps: rewrite the data generator, add the distribute in sampler, and add “ddp” in the pl.trainer.

You can search for “pytorch lightening ddp” for more information.

I will mark this as one request and make it support multi-gpu in the future.