An Image is worth 16x16 words: Implementing the Visual Transformer (ViT) in PyTorch

This repository contains code for implementing the Visual Transformer (ViT) model introduced in the research paper "An Image is worth 16x16 words". The model is made with PyTorch.

The paper proposes an approach to image classification that uses the power of self-attention mechanisms from natural language processing. The ViT model achieves state-of-the-art performance on several image classification benchmarks, including ImageNet, with fewer computational resources than traditional convolutional neural networks.

In this project, an implementation of the ViT model in PyTorch is provided. Model is trained on subset of Food101 dataset.

The notebook is made while learning about pytorch,deep leaning and following udemy course: PyTorch for Deep Learning in 2023: Zero to Mastery, by Andrei Neagoie,Daniel Bourke,Zero To Mastery. The main instructor github profile: mrdbourke The course repository: https://github.com/mrdbourke/pytorch-deep-learning

All code is contained in "vit_pytorch_paper_replicating.ipynb" file. I recommend openning it with google colab.

mihajlo-ostojic/visual-transformer-paper-replication-with-pytorch

An Image is worth 16x16 words: Implementing the Visual Transformer (ViT) in PyTorch