Multi GPU training

Question

Multi GPU training

VV1314 opened this issue 2 years ago · 7 comments

PConv may be just useful in only 1 GPU, I run it in two GPUs, it doesn't work. So it can be resolved?

Answer 1 · 2023-03-30T14:29:04.000Z

Hi, could you be more specific on your question? PConv should work with multiple GPUs. And there have been some works successfully reproducing our results, e.g., the repo.

Answer 2 · 2023-03-31T12:16:39.000Z

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

Answer 3 · 2023-04-04T08:02:15.000Z

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

This is a general issue regardless of PConv. Please make sure your model and data are on the same devices. You may refer to the following links fore more details:
https://stackoverflow.com/questions/70884007/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least
https://discuss.pytorch.org/t/nn-conv2d-causing-error-in-multi-gpu-learning/142804/2
https://zhuanlan.zhihu.com/p/560322701
https://discuss.pytorch.org/t/error-with-dataparallel/139384

Answer 4 · 2023-05-30T10:44:39.000Z

same issue. this may be caused by internal implementation in DataParallel. To solve it, you can simply rename forward_split_cat to forward and comment some codes in __init__.

Answer 5 · 2023-07-03T08:34:53.000Z

在使用nn.DataParallel单机多卡时，报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)，解决方法如下：
把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写：

class Partial_conv3(nn.Module):

def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x

Answer 6 · 2023-08-11T05:36:35.000Z

在使用nn.DataParallel单机多卡时，报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)，解决方法如下：把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写：

class Partial_conv3(nn.Module):
def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x        

太感谢你了

Answer 7 · 2023-08-11T05:37:01.000Z

谢谢我已收到您的来信~