/Image-Recognition-with-Transformers

Implementation of the paper, An Image is Worth 16x16 Words: Transformers for Image Recognition in PyTorch.

Primary LanguageJupyter NotebookMIT LicenseMIT

Image Recognition with Transformers

Pytorch implementation of the, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. This project demonstates how to replicate the Vision Transformer (ViT) architecture as well as a comprehensive guide for how to replicate deep learning research papers in PyTorch.

ViT Overview

Vision Transformers (Vit) is a transformer based approach for Image Classification. It transforms an image into a learnable 1D token embedding and propagates through a standard Transformer encoder.

Project Structure

Usage

License

This project is licensed under the terms of the MIT license.