JierunChen/FasterNet

Multi GPU training

VV1314 opened this issue · 7 comments

PConv may be just useful in only 1 GPU, I run it in two GPUs, it doesn't work. So it can be resolved?

Hi, could you be more specific on your question? PConv should work with multiple GPUs. And there have been some works successfully reproducing our results, e.g., the repo.

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

This is a general issue regardless of PConv. Please make sure your model and data are on the same devices. You may refer to the following links fore more details:
https://stackoverflow.com/questions/70884007/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least
https://discuss.pytorch.org/t/nn-conv2d-causing-error-in-multi-gpu-learning/142804/2
https://zhuanlan.zhihu.com/p/560322701
https://discuss.pytorch.org/t/error-with-dataparallel/139384

same issue. this may be caused by internal implementation in DataParallel. To solve it, you can simply rename forward_split_cat to forward and comment some codes in __init__.

在使用nn.DataParallel单机多卡时,报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution),解决方法如下:
把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写:

class Partial_conv3(nn.Module):

def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x        
xibici commented

在使用nn.DataParallel单机多卡时,报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution),解决方法如下: 把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写:

class Partial_conv3(nn.Module):

def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x        

太感谢你了