MultiGPU Support (DataParallel)
Opened this issue · 1 comments
turtleman99 commented
Hi Fangzhou,
Thank you for your excellent work. The codebase is well-organized and easy to follow.
When I tried to train mini-imagenet
using either 2 - 8 GPUs by the following command,
python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3
python keeps reporting errors shown as below,
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
File "train.py", line 265, in <module>
main(config)
File "train.py", line 130, in main
logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "**/PyTorch-MAML/models/maml.py", line 223, in forward
updated_params = self._adapt(
File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
params, mom_buffer = self._inner_iter(
File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
grads = autograd.grad(loss, params.values(),
File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.
However, the code only works while using 1 GPU. When n_episode=4
, I assume the code should work on 2 or 4 GPUs.
Framework Versions:
python: 3.8
pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0
Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!
woreom commented
@turtleman99 I have the same problem, have you found a solution?