/Vision-Transformers

This repo has all the basic things you'll need in-order to understand complete vision transformer architecture and its various implementations.

Primary LanguagePythonMIT LicenseMIT

Zero-to-Hero: ViT🚀

I have tried to cover all the bases for understanding and implementing Vision Transformers (ViT) and their evolution into Video Vision Transformers (ViViT). The main focus is on dealing with the spatio-temporal relations using visual transformers.

image

1. Vision Transformer (ViT) Fundamentals

Surveys and Overviews:

Key Papers:

  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: Paper | Code
  • Training data-efficient image transformers & distillation through attention (DeiT): Paper | Code

Concepts and Tutorials:

  • "Attention Is All You Need": Paper
  • "The Illustrated Transformers": Blog Post
  • "Vision Transformer Explained" Blog Post

2. Convolutional ViT and Hybrid Models:

  • CvT: Introducing Convolutions to Vision Transformers: Paper | Code
  • CoAtNet: Marrying Convolution and Attention for All Data Sizes: Paper
  • ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases: Paper | Code

3. Efficient Transformers and Swin Transformer:

  • Swin Transformer: Hierarchical Vision Transformer using Shifted Windows: Paper | Code
  • Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions: Paper | Code
  • Efficient Transformers: A Survey: Paper

4. Space-Time Attention and Video Transformers:

  • TimeSformer: Is Space-Time Attention All You Need for Video Understanding? Paper | Code
  • Space-Time Mixing Attention for Video Transformer: Paper
  • MViT: Multiscale Vision Transformers: Paper | Code

5. Video Vision Transformer (ViViT):

How to use this Repo?

  • Start by reading the survey papers to get a broad understanding of the field.
  • For each key paper, read the abstract and introduction, then skim through the methodology and results sections.
  • Implement key concepts using the provided GitHub repositories or your own code.
  • Experiment with different architectures and datasets to solidify your understanding.
  • Use the additional resources to dive deeper into specific topics or applications.