About subtract in pooling
Opened this issue ยท 16 comments
Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?
Hi @Dong-Huo ,
As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input
itself is added in Equation (4). Experimentally, average pooling without substrating still works, but it is a little worse than adding subtraction.
Hi @Dong-Huo ,
As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4). Experimentally, average pooling without substrating still works, but it is a little worse than adding subtraction.
@yuweihao Have you tried removing the residual connection for token mixer? Currently you subtract "normed" x (basically y = x + pooling(norm(x)) - norm(x)
) which seems weird.
Hi @yangcf10 ,
It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer.
Instead, I have tried removing subtraction, i.e., replacing return self.pool(x) - x
with return self.pool(x)
in my preliminary experiments. return self.pool(x)
also works well with slight performance decrease than that of return self.pool(x) - x
.
Hi @yangcf10 ,
It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer.
Instead, I have tried removing subtraction, i.e., replacing
return self.pool(x) - x
withreturn self.pool(x)
in my preliminary experiments.return self.pool(x)
also works well with slight performance decrease than that ofreturn self.pool(x) - x
.
Thanks for the prompt reply! I understand it's mostly from empirical results. But any insight why we should do the subtraction? The explanation "since the MetaFormer block already has a residual connection so we should add subtraction" seems not to be convincing. If we treat token mixer as an abstracted module, then we shouldn't consider the residual connection when designing it.
Hi @yangcf10 ,
Thank you for your feedback and suggestion. We will attempt to further improve the explanation "since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4)".
Why don't we just remove the residual connection and the subtraction then? It would save compute and memory.
What I'm more concerned about is that the subtraction and the residual connection don't use the same "x" so they don't null each other. Indeed, the residual connection uses a pre-norm x while the subtraction uses a post-norm x.
It changes the semantics to something along the lines of a block emphasizing the spatial gradients.
What do you think? Does it work as well without the residual connection and the subtraction?
okay I saw your other comments about using DW conv instead of pooling. I understand that poolformer is not what your paper is about it but about the MetaFormer and the poolformer is indeed just a demonstration. Also, the fact that DW conv brings similar or superior performance shows that there is nothing special in this pooling layer, let alone this subtraction. This is missing the forest for the trees.
Hi @Vermeille ,
Many thanks for your attention to this work and insightful comment. Yeah, the target in this work is to demonstrate the competence of transformer-like models primarily stem from the general architecture MetaFormer. The Pooling/PoolFormer are just tools to demonstrate the MetaFormer. If considering PoolFormer as a practical model to use, as your comment, it can be further improved from implementation efficiency and other aspects.
Is there some relation between this pooling operation and graph convolutional networks? Because graphs have no regular structure GCNs are essentially some kind of pooling followed by MLP - which seems a lot like PoolFormer, though the MetaFormer still has an image pyramid which isn't present in graphs.
Hi @saulzar , pooling is a basic operator in deep learning. Transformer or MetaFormer can be regarded as a type of Graph Neural Networks [1]. From this perspective, attention or pooling in MetaFormer can be regarded as a type of graph attention or graph pooling, respectively.
[1] https://graphdeeplearning.github.io/post/transformers-are-gnns/
I understand it's mostly from empirical results. But any insight why we should do the subtraction?
Average Pool combining with Subtraction yields a [Laplacian kernel] (https://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm)
[1 1 1
1 -8 1
1 1 1 ]
which is the classical kernel of image processing. The Laplacian kernel computes the Spatial gradient. So the token mixer of Pool Former is actually: x <- x + alpha*Laplacian(x)
.
Hi @chuong98 ,
Yes, it can be regarded as a fixed kernel in image processing (vs traditional CNN's learnable kernels). For each token, Laplacian(x) aggregates nearby token information different from itself, while the residual connection remains information of itself. The alpha in Normalization or LayerScale can balance nearby information and own information. Without subtraction, since the MetaFormer block already has a residual connection, alpha then becomes into balancing [nearby information + own information] and [own information], which looks weird. The above reason may make the performance with subtraction slightly better than that without subtraction.
Thanks for your continued attention to our work. Happy new year in advance :)
For anyone wondering, I got the following results on ImageNet-100:
- vanilla PoolFormer (
return self.pool(x) - x
): 87.64 - simple pooling (
return self.pool(x)
): 87.56 - DWConv (
return dwconv(x)
): 88.10
that is really helpful, thanks @DonkeyShot21. @yuweihao Can you add these extra experiments to your revised version ?