absence of ablation study on the size of the convolution layer for aggregating the static context
songkq opened this issue · 5 comments
The conv3x3 was adopted as the default setting in the paper. How does the kernel size influence the overall performance of CoTNet? If we apply the dilated convolution or deformable convolution to enlarge the receptive field and obtain more powerful context representation, how about the performance?
We experimented by using larger kernel size (e.g., 5×5 conv), the top-1 score of CoTNet-50 is increased from 79.2% to 79.3%, while FLOPs is increased by 31.9%. Therefore we choose 3×3 conv that seeks a better cost-accuracy tradeoff.
Hi, @YehLi thanks for your reply.
self.key_embed = nn.Sequential(
nn.Conv2d(dim, dim, self.kernel_size, stride=1, padding=self.kernel_size//2, groups=4, bias=False),
nn.BatchNorm2d(dim),
nn.ReLU(inplace=True)
)
Why you choose the group convolution for aggregating the static context here? And how does the number of groups influence the performance?
We use group convolution to reduce the FLOPs and #params. groups=4 is a good speed-accuracy trade-off. Standard convolution(groups=1) can achieve slightly better results, but the FLOPs and #params are quite large. Too many groups (e.g., groups=32 or depthwise convolution) will drop performance slightly without obvious speedup. You can refer to ShuffleNet V2 for more discussions about the group convolution.
Thanks @YehLi . Did you extend this work for other downstream tasks, e.g., text line recognition?
We will experiment with more downstream tasks later.