MultiGPU Support (DataParallel)

Question

MultiGPU Support (DataParallel)

Opened this issue 3 years ago · 1 comments

Hi Fangzhou,

Thank you for your excellent work. The codebase is well-organized and easy to follow.

When I tried to train mini-imagenet using either 2 - 8 GPUs by the following command,

python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3

python keeps reporting errors shown as below,

meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
  File "train.py", line 265, in <module>
    main(config)
  File "train.py", line 130, in main
    logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/PyTorch-MAML/models/maml.py", line 223, in forward
    updated_params = self._adapt(
  File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
    params, mom_buffer = self._inner_iter(
  File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
    grads = autograd.grad(loss, params.values(),
  File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
    return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.

However, the code only works while using 1 GPU. When n_episode=4, I assume the code should work on 2 or 4 GPUs.

Framework Versions:

python: 3.8
pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0

Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!

Answer 1 · 2024-01-22T23:28:18.000Z

@turtleman99 I have the same problem, have you found a solution?