A reimplementation of the ViT visual model based on the architecture of a transformer orginally designed for text-base tasks.
This is my implementation of the ViT Model for practicing Pytorch
ViT is a computer vision model that is built on the attention mechanism and the well-known architecture of Transformer to ensure the use of contextual information (including the position of each frame and the labels granted for the image in this case for us).
The following is the research paper: Research Paper
The official Jax repository is here.
A tensorflow2 translation also exists here, created by research scientist Junho Kim! 🙏
.
├── ViT.py
├── __pycache__
│ ├── ViT.cpython-310.pyc
│ ├── data_setup.cpython-310.pyc
│ ├── data_setup.cpython-38.pyc
│ ├── engine.cpython-310.pyc
│ ├── engine.cpython-38.pyc
│ ├── helper_functions.cpython-310.pyc
│ ├── helper_functions.cpython-38.pyc
│ ├── main.cpython-310.pyc
│ ├── mlp.cpython-310.pyc
│ ├── msa.cpython-310.pyc
│ ├── patch_embedding.cpython-310.pyc
│ ├── path_embedding.cpython-310.pyc
│ └── transformer_encoder.cpython-310.pyc
├── data_setup.py
├── engine.py
├── helper_functions.py
├── main.py
├── mlp.py
├── msa.py
├── patch_embedding.py
├── train.py
└── transformer_encoder.py
$ git clone https://github.com/hkt456/ViT-Model.git
$ git cd ViT-Model
In order to get the overview of the structure of the Multihead Attention layer, Multi-layer Perceptron layer, Transformer Encoder, and the ViT model:
python3 source/main.py
For training and testing out the model, you can use data_setup for downloading the neccessary data and set up dataloaders:
from data_setup import *
get_data() # Automatically downloads a sample classification image datasets
create_dataloaders() # Returns a tuple of (train_dataloader, test_dataloader, class_names) where class_names is a list of the target classes.
Despite not having set up automatic training, there's already a template for training, testing, and evaluating the performance of the model:
from engine import *
train()
"""
A dictionary of training and testing loss as well as training and
testing accuracy metrics. Each metric has a value in a list for
each epoch.
In the form: {train_loss: [...],
train_acc: [...],
test_loss: [...],
test_acc: [...]}
For example if training for epochs=2:
{train_loss: [2.0616, 1.0537],
train_acc: [0.3945, 0.3945],
test_loss: [1.2641, 1.5706],
test_acc: [0.3400, 0.2973]}
"""
There are also functions for illustrating accuravy, loss,... a lot of things. Feel free to check out helper_functions.py
img_size
: int = 224
Default value is set to 224, defining the dimensions of a 224x224 image to be processed
in_channels
: int = 3
Default value is set to 3, defining the number of channels of the input to be passed into the
patch_embedding
layer (The patcher - Conv2D layer)patch_size
: int = 16
Default value is set to 16, defining the size of the patch to be later turned into embedding through the
patch_embedding
layernumber_transformer_blocks
: int = 12
Default value is set to 12 to replicate the number of transformer blocks reported to be used in the architecture in the research paper
embedding_dim
: int = 768
Default value is set to 768, defining the dimension of the embedding matrix used throughout different layers
mlp_size
: int = 3072
Default value is set to 3072, defining the
out_features
for thenn.Linear
layers inside the MLP layernum_heads
: int = 12
Default value is set to 12, defining the number of
MultiheadAttention
blocks for eachMSA
layerattn_dropout
: float = 0
Default value is set to 0 like in the paper, defining the
dropout
parameter forMultiheadAttention
mlp_dropout
: float = 0.1
Default value is set to 0.1 like in the paper, defining the
dropout
parameter forMLPBlock
embedding_dropout
: float = 0.1Default value is set to 0.1 like in the paper to randomly drop embeddings
num_classes
: int = 1000Default value is set to 1000, defining the number of classes to classify
This project is licensed under the MIT License.