gungui98/Pytorch-Depthwise-Conv3d

different speed of 3d dw conv between torch1.9 and your version

Andy1621 opened this issue · 1 comments

Thanks for your help!

Have you test the speed of 3D depthwise convolution between torch1.9 and your version?

I use SlowFast and test my model with 3d depthwise convolution.
In my experiments, it seems that toch1.9 is about twice faster than your version, which is a little strange...

Doesn't seem to reproduce the problem. I have tested the module in Colab

import torch
from depthwise_conv3d import DepthwiseConv3d
from torch.profiler import profile, record_function, ProfilerActivity
from torch.nn import Conv3d
import time


dtype = torch.float
conv = DepthwiseConv3d(2, 2, kernel_size=3, groups=2).to("cuda", dtype)
input = torch.randn(2, 2, 6, 6, 6, device="cuda", dtype=dtype).div_(2).requires_grad_()
output = conv(input)

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
  with record_function("inference"):
    for i in range(100):
      output = conv(input)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

DW 3D

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                              inference         0.06%       1.153ms       100.00%        1.989s        1.989s       0.000us         0.00%     664.000us     664.000us             1  
                                DepthwiseConv3dFunction         0.18%       3.546ms        99.94%        1.988s      19.878ms     664.000us       100.00%     664.000us       6.640us           100  
                                       cudaLaunchKernel        99.70%        1.983s        99.70%        1.983s      19.831ms       0.000us         0.00%       0.000us       0.000us           100  
                                            aten::empty         0.06%       1.151ms         0.06%       1.151ms      11.284us       0.000us         0.00%       0.000us       0.000us           102  
                                            aten::zeros         0.00%      36.000us         0.00%      48.000us      48.000us       0.000us         0.00%       0.000us       0.000us             1  
                                  cudaDeviceSynchronize         0.00%      14.000us         0.00%      14.000us      14.000us       0.000us         0.00%       0.000us       0.000us             1  
                                            aten::zero_         0.00%       4.000us         0.00%       4.000us       4.000us       0.000us         0.00%       0.000us       0.000us             1  
void conv_depthwise3d_cuda_kernel<float, float, 3, 3...         0.00%       0.000us         0.00%       0.000us       0.000us     664.000us       100.00%     664.000us       6.640us           100  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.989s
Self CUDA time total: 664.000us

Pytorch 1.9

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                              inference         0.08%       1.684ms       100.00%        2.008s        2.008s       0.000us         0.00%     662.000us     662.000us             1  
                                           aten::conv3d         0.02%     388.000us        99.91%        2.006s      20.062ms       0.000us         0.00%     662.000us       6.620us           100  
                                      aten::convolution         0.02%     416.000us        99.89%        2.006s      20.058ms       0.000us         0.00%     662.000us       6.620us           100  
                                     aten::_convolution         0.06%       1.296ms        99.87%        2.005s      20.054ms       0.000us         0.00%     662.000us       6.620us           100  
                                 aten::conv_depthwise3d         0.07%       1.311ms        99.81%        2.004s      20.041ms     662.000us       100.00%     662.000us       6.620us           100  
                                       cudaLaunchKernel        99.72%        2.002s        99.72%        2.002s      20.023ms       0.000us         0.00%       0.000us       0.000us           100  
                                            aten::empty         0.02%     486.000us         0.02%     486.000us       4.765us       0.000us         0.00%       0.000us       0.000us           102  
                                            aten::zeros         0.00%      33.000us         0.00%      49.000us      49.000us       0.000us         0.00%       0.000us       0.000us             1  
                                  cudaDeviceSynchronize         0.00%      11.000us         0.00%      11.000us      11.000us       0.000us         0.00%       0.000us       0.000us             1  
                                            aten::zero_         0.00%       3.000us         0.00%       3.000us       3.000us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.008s
Self CUDA time total: 662.000us