absence of ablation study on the size of the convolution layer for aggregating the static context

Question

absence of ablation study on the size of the convolution layer for aggregating the static context

songkq opened this issue 3 years ago · 5 comments

The conv3x3 was adopted as the default setting in the paper. How does the kernel size influence the overall performance of CoTNet? If we apply the dilated convolution or deformable convolution to enlarge the receptive field and obtain more powerful context representation, how about the performance?

Answer 1 · 2021-07-30T09:29:39.000Z

We experimented by using larger kernel size (e.g., 5×5 conv), the top-1 score of CoTNet-50 is increased from 79.2% to 79.3%, while FLOPs is increased by 31.9%. Therefore we choose 3×3 conv that seeks a better cost-accuracy tradeoff.

Answer 2 · 2021-07-31T01:59:29.000Z

Hi, @YehLi thanks for your reply.

self.key_embed = nn.Sequential(
            nn.Conv2d(dim, dim, self.kernel_size, stride=1, padding=self.kernel_size//2, groups=4, bias=False),
            nn.BatchNorm2d(dim),
            nn.ReLU(inplace=True)
        )

Why you choose the group convolution for aggregating the static context here? And how does the number of groups influence the performance?

Answer 3 · 2021-07-31T09:14:52.000Z

We use group convolution to reduce the FLOPs and #params. groups=4 is a good speed-accuracy trade-off. Standard convolution(groups=1) can achieve slightly better results, but the FLOPs and #params are quite large. Too many groups (e.g., groups=32 or depthwise convolution) will drop performance slightly without obvious speedup. You can refer to ShuffleNet V2 for more discussions about the group convolution.

Answer 4 · 2021-07-31T15:01:04.000Z

Thanks @YehLi . Did you extend this work for other downstream tasks, e.g., text line recognition?

Answer 5 · 2021-08-01T09:25:42.000Z

We will experiment with more downstream tasks later.