rohitgirdhar/AttentionalPoolingAction

Attentional pooling for CIFAR10/100, STL10 dataset

Closed this issue · 3 comments

Thanks a lot for the paper and sharing the code.
It seems that for CIFAR10 dataset, the result is very similar with and without attentional pooling.

With: 92.87%
Without: 92.86%

ResNet 32 (residual blocks: 5-5-5, output_channels: 16-32-64 , number of parameters: 0.46 ) is used for this test.

Is this attentional pooling dependent on special datasets?

Thanks @LiliMeng for the feedback and trying out attentional pooling on CIFAR-10!
Here's what I think might be happening here:

  1. Given the connection (low-rank approx) we show to 2nd order pooling, a variant of which has been shown to be useful for fine-grained recognition (bilinear cnn), and concurrently for other fine-grained tasks like VQA (relation nets), the 10-way classification task on CIFAR might not be "fine-grained" enough.
  2. Again using the connection to 2nd order pooling, I think attentional pooling would be most useful for tasks that require "interaction" features ($X^TX$); i.e. pair-wise features of one part of the image/video with another. This I think is especially true in action recognition and VQA, which might explain the recent success of similar self-attention methods for these tasks (Attend and Interact, Non-local Neural Networks)
  3. The resolution of images is also important. If your base network has high receptive field, you might want to increase the resolution of the input images (simple resizing would work), so that the last layer neurons look at different regions of the image, and attentional pooling can down-weight certain regions.

That said, our attention module is super light-weight, and seems to mostly maintain or improve performance; it might be useful to keep around in your network architectures 🙂

Thanks a lot @rohitgirdhar for your kind and detailed reply! :)

  1. I also tried on attentional pooling for CIFAR100 with ResNet-32, the result is: 69.44% (with attentional pooling) vs.70.05% (without). Although CIAFR100 has 10 super classes, each super class has 10 sub-classes, such as rabbit and squirrel. Maybe it is not "fine-grained" as well? The CIFAR10/100 feature map before the pooling layer is [batch_size, 8, 8, 64]. Maybe 8x8 is too small for weighting certain regions. Or because the CIFAR10/100, the object needed to recognize have already taken up most space of the image, additional attention may not be helpful?

  2. Is resolution of images also important? I also tried on STL10, it's 96x96 (three times of CIFAR10/100), the result is 79.69% (with attentional pooling) Vs. 80.92% (without).

  3. I'll have a try to use large images and activity datasets.

Thanks for trying the other experiments. Yes, the resolution of the image is quite important, as you want different features at the last layer focus on different areas of the image.