JDAI-CV/CoTNet

这篇论文提到的block是不是用到了involution

JasonLeeFdu opened this issue · 8 comments

论文公式5 以及代码中的LocalConvolution aggregation_zeropad方法,是不是借鉴到了involution?有什么不一样的呢?

还有,我想问一下,代码中的cotlayer是本文的贡献,那么 CoXtLayer是做什么的呢?

can be answered in english

YehLi commented

Involution shares similar spirit with the paper: Pay less attention with lightweight and dynamic convolutions

There are two main difference between CoTNet and Involution: 1. CoTNet mines the static context among keys via a 3×3 convolution. 2. CoTNet performs self-attention based on the query and contextualized key, while Involution directly generates the kernel by 1x1 convolution.

CoXtLayer is similar to CotLayer, which has higher dimension with two groups.

I confused about the function mentioned in the paper:
the output channels of the second 1x1 convolution are defined by kxkxCh.
In the paper, you explained: Ch is the number of heads, and kxk is the local grid in space.
Can I understand that like this: In a transformer block we usually defined a hyper-parameter head (Ch), and then we reshape the output channels into (Ch, kxk)?
Another question is you used the LocalConvolution, I do not know why ?
Can you explain, thank you.

YehLi commented

In CoT block, we reshape the output channels into (Ch, kxk).

LocalConvolution is used for aggregating all values within each k × k grid with the learnt local attention matrix in equation 3. Section 3.4 discusses the connections between self-attention and dynamic region-aware convolution.

In CoT block, we reshape the output channels into (Ch, kxk).

LocalConvolution is used for aggregating all values within each k × k grid with the learnt local attention matrix in equation 3. Section 3.4 discusses the connections between self-attention and dynamic region-aware convolution.

您好,可以比较详细的解释一下LocalConvolution吗?

好的,谢谢,明白了