Implementation of the ViT model in Pytorch from the paper 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' by Google Research.
- Explore Transformer-based architectures for Computer Vision Tasks.
- Transformers have been the de-facto for NLP tasks, and CNN/Resnet-like architectures have been the state of the art for Computer Vision.
- Till date, researchers have tried using attention for Vision, but used them in conjunction with CNN.
- This paper mainly discusses the strength and versatility of vision transformers, as it kind of approves that they can be used in recognition and can even beat the state-of-the-art CNN.
- The authors have tested different variants of Vision Transformer having different patch size, number of layers, and embedding dimension, on datasets of different sizes –
ImageNet
,JFT300M
,CIFAR10/100
etc. - The results of Vision Transformer have been compared with results of other architectures as well –
BiT (Resnet 152x4)
, andEfficientNet
, on same conditions. - The models have also been evaluated on
VTAB classification
suite consisting of19
tasks divided into groups as Natural, Specialized and Structured Tasks. - They have also performed a preliminary exploration on masked patch prediction for self-supervision.
Due to non-availability of powerful compute on Google Colab, we chose to train and test on these 2 datasets –
- Vision Transformer implemented from scratch which includes Multihead Attention, FeedForward, Transformer and Classification Head module (VisionTransformer.ipynb)
- Hybrid Variant of Vision Transformer with Pretrained ResNet features as input to the Transformer (VisionTransformer.ipynb)
- Modular ResNet implemented from scratch (ResNet34, ResNet50) (VisionTransformer.ipynb)
- Pretrained Vision Transformer using external code to verify results from the paper (PretrainedVisionTransformer.ipynb)
- Attention Map Visualization on input images (PretrainedVisionTransformer.ipynb)
- Visualization of Filter Embeddings (PretrainedVisionTransformer.ipynb)
- Visualization of Position Embeddings (PretrainedVisionTransformer.ipynb)
- Patch size in the Vision Transformer decides the length of the sequence. Lower patch size leads to higher information exchange during the self attention mechanism. This is verified by the better results using lower patch-size 4 over 8 on a 32x32 image
- Increasing the number of layers of the Vision Transformer should ideally lead to better results but the results on the 8 Layer model are marginally better than the 12 Layer model which can be attributed to the small datasets used to train the models. Models with higher complexity require more data to capture the image features
- As noted in the paper, Hybrid Vision Transformer performs better on small datasets compared to ViT as the initial ResNet features are able to capture the lower level features due to the locality property of Convolutions which normal ViT is not able to capture with the limited data available for training.
- ResNets trained from scratch are able to outperform both ViT and Hybrid-ViT trained from scratch due to its inherent inductive bias of locality and translation invariance. These biases can not learned by the ViT on small datasets.
- PreTrained ViT performs much better than the other methods due to being trained on huge datasets and thus having learned the better representations than even ResNet since it can access much further information right from the very beginning unlike CNN.
- Due to non-availability of better computing resources, the model could not be trained on large datasets which is the first and the foremost requirement of this architecture to produce very high accuracies. Due to this limitation, we could not produce accuracies as mentioned in the paper in implementation from scratch.
- Evaluating the model on VTAB classification suite.
- Different Attention mechanisms could be explored that take the 2D structure of images into account.
Presentation can be accessed here.
Name | ID |
---|---|
Akshit Khanna | 2017A7PS0023P |
Vishal Mittal | 2017A7PS0080P |
Raghav Bansal | 2017A3PS0196P |
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper)
- Eunkwang Jeon, ViT-pytorch (2020), GitHub repository