Multi GPU training
VV1314 opened this issue · 7 comments
PConv may be just useful in only 1 GPU, I run it in two GPUs, it doesn't work. So it can be resolved?
Hi, could you be more specific on your question? PConv should work with multiple GPUs. And there have been some works successfully reproducing our results, e.g., the repo.
"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error
"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error
This is a general issue regardless of PConv. Please make sure your model and data are on the same devices. You may refer to the following links fore more details:
https://stackoverflow.com/questions/70884007/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least
https://discuss.pytorch.org/t/nn-conv2d-causing-error-in-multi-gpu-learning/142804/2
https://zhuanlan.zhihu.com/p/560322701
https://discuss.pytorch.org/t/error-with-dataparallel/139384
same issue. this may be caused by internal implementation in DataParallel. To solve it, you can simply rename forward_split_cat
to forward
and comment some codes in __init__
.
在使用nn.DataParallel单机多卡时,报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution),解决方法如下:
把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写:
class Partial_conv3(nn.Module):
def __init__(self, dim, n_div, forward):
super().__init__()
self.dim_conv3 = dim // n_div
self.dim_untouched = dim - self.dim_conv3
self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
self.forward_type = forward
# if forward == 'slicing':
# self.forward = self.forward_slicing
# elif forward == 'split_cat':
# self.forward = self.forward_split_cat
# else:
# raise NotImplementedError
def forward(self, x: Tensor) -> Tensor:
if self.forward_type == 'slicing':
# only for inference
x = x.clone() # !!! Keep the original input intact for the residual connection later
x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
elif self.forward_type == 'split_cat':
# for training/inference
x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
x1 = self.partial_conv3(x1)
x = torch.cat((x1, x2), 1)
return x
在使用nn.DataParallel单机多卡时,报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution),解决方法如下: 把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写:
class Partial_conv3(nn.Module):
def __init__(self, dim, n_div, forward): super().__init__() self.dim_conv3 = dim // n_div self.dim_untouched = dim - self.dim_conv3 self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False) self.forward_type = forward # if forward == 'slicing': # self.forward = self.forward_slicing # elif forward == 'split_cat': # self.forward = self.forward_split_cat # else: # raise NotImplementedError def forward(self, x: Tensor) -> Tensor: if self.forward_type == 'slicing': # only for inference x = x.clone() # !!! Keep the original input intact for the residual connection later x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :]) elif self.forward_type == 'split_cat': # for training/inference x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1) x1 = self.partial_conv3(x1) x = torch.cat((x1, x2), 1) return x
太感谢你了