different speed of 3d dw conv between torch1.9 and your version
Andy1621 opened this issue · 1 comments
Andy1621 commented
Thanks for your help!
Have you test the speed of 3D depthwise convolution between torch1.9 and your version?
I use SlowFast and test my model with 3d depthwise convolution.
In my experiments, it seems that toch1.9 is about twice faster than your version, which is a little strange...
gungui98 commented
Doesn't seem to reproduce the problem. I have tested the module in Colab
import torch
from depthwise_conv3d import DepthwiseConv3d
from torch.profiler import profile, record_function, ProfilerActivity
from torch.nn import Conv3d
import time
dtype = torch.float
conv = DepthwiseConv3d(2, 2, kernel_size=3, groups=2).to("cuda", dtype)
input = torch.randn(2, 2, 6, 6, 6, device="cuda", dtype=dtype).div_(2).requires_grad_()
output = conv(input)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
with record_function("inference"):
for i in range(100):
output = conv(input)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
DW 3D
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
inference 0.06% 1.153ms 100.00% 1.989s 1.989s 0.000us 0.00% 664.000us 664.000us 1
DepthwiseConv3dFunction 0.18% 3.546ms 99.94% 1.988s 19.878ms 664.000us 100.00% 664.000us 6.640us 100
cudaLaunchKernel 99.70% 1.983s 99.70% 1.983s 19.831ms 0.000us 0.00% 0.000us 0.000us 100
aten::empty 0.06% 1.151ms 0.06% 1.151ms 11.284us 0.000us 0.00% 0.000us 0.000us 102
aten::zeros 0.00% 36.000us 0.00% 48.000us 48.000us 0.000us 0.00% 0.000us 0.000us 1
cudaDeviceSynchronize 0.00% 14.000us 0.00% 14.000us 14.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zero_ 0.00% 4.000us 0.00% 4.000us 4.000us 0.000us 0.00% 0.000us 0.000us 1
void conv_depthwise3d_cuda_kernel<float, float, 3, 3... 0.00% 0.000us 0.00% 0.000us 0.000us 664.000us 100.00% 664.000us 6.640us 100
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.989s
Self CUDA time total: 664.000us
Pytorch 1.9
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
inference 0.08% 1.684ms 100.00% 2.008s 2.008s 0.000us 0.00% 662.000us 662.000us 1
aten::conv3d 0.02% 388.000us 99.91% 2.006s 20.062ms 0.000us 0.00% 662.000us 6.620us 100
aten::convolution 0.02% 416.000us 99.89% 2.006s 20.058ms 0.000us 0.00% 662.000us 6.620us 100
aten::_convolution 0.06% 1.296ms 99.87% 2.005s 20.054ms 0.000us 0.00% 662.000us 6.620us 100
aten::conv_depthwise3d 0.07% 1.311ms 99.81% 2.004s 20.041ms 662.000us 100.00% 662.000us 6.620us 100
cudaLaunchKernel 99.72% 2.002s 99.72% 2.002s 20.023ms 0.000us 0.00% 0.000us 0.000us 100
aten::empty 0.02% 486.000us 0.02% 486.000us 4.765us 0.000us 0.00% 0.000us 0.000us 102
aten::zeros 0.00% 33.000us 0.00% 49.000us 49.000us 0.000us 0.00% 0.000us 0.000us 1
cudaDeviceSynchronize 0.00% 11.000us 0.00% 11.000us 11.000us 0.000us 0.00% 0.000us 0.000us 1
aten::zero_ 0.00% 3.000us 0.00% 3.000us 3.000us 0.000us 0.00% 0.000us 0.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 2.008s
Self CUDA time total: 662.000us