Pytorch implementation of the, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This project demonstates how to replicate the Vision Transformer (ViT) architecture as well as a comprehensive guide for how to replicate deep learning research papers in PyTorch.
Vision Transformers (Vit) is a transformer based approach for Image Classification. It transforms an image into a learnable 1D token embedding and propagates through a standard Transformer encoder.
This project is licensed under the terms of the MIT license.