sail-sg/poolformer

Checkpoints of the Ablation study

chuong98 opened this issue · 5 comments

Hi, thanks for your amazing work.
I am reading the Tab 6, and I am surprised because the method is so simple and very effective, especially when the Pooling is replaced with Identity Mapping. Top1 74.3 on ImageNet-1k with only Conv1x1 and Norm layer. I am thrilled...
Can you release this checkpoint so that we can verify. Thanks again.
image

If the baseline implementation is:

 def forward(self,x):
            x = x + self.drop_path(self.token_mixer(self.norm1(x)))
            x = x + self.drop_path(self.mlp(self.norm2(x)))

Is one of the following implementation correct for the Identity Case ?:
Case A:

  def forward(self,x):
            x = x + self.drop_path(self.norm1(x))
            return x + self.drop_path(self.mlp(self.norm2(x)))

Case B:

  def forward(self,x):
          x = x + self.drop_path(self.norm1(x)-x)
          return  x + self.drop_path(self.mlp(self.norm2(x)))

Case C:

  def forward(self,x):
          return  x + self.drop_path(self.mlp(self.norm2(x)))

Hi @chuong98 ,

There is a simple way to implement it. You just need to modify self.token_mixer = Pooling(pool_size=pool_size) to self.token_mixer = nn.Identity().

The checkpoint is shown in:
poolformer_id_s12.pth.tar

Wonderful! I evaluated the ckpt ang got 74.336.
image
But when I inspect the speed, the sp_12 is about 3x slower than ResNet 18. I inspect the model, and when I replace the GroupNorm with BatchNorm, the inference time reduces 1/2.
Would you mind release the ckpts of Using BatchNorm/and/or ReLU ? Thank you so much.

Regarding to using Identity instead of pooling, I can't explain why it works. Because pooling is the only mechanism to learn the spatial information, and connect the neighbors. Now we even drop the pooling. Can you share your thoughts?

Hi @chuong98 ,

The checkpoints of BatchNorm or ReLU are in poolformer_bn_s12.pth.tar, poolformer_relu_s12.pth.tar.

The reason why the identity token mixer still works is that sometimes local information is enough to predict. For example, for humans, our faces are largely different from other animals.