GNR-650

Assignment 1 (Overfit Resnet 18 model):

Took 100 images per samples of cifar-10 dataset.
Resnet 18 architecture
Overfit the model

-> Training accuracy: 99%~100%

-> Testing accuray: 46%

Showed the magnitude of Kernels weights at different layers
Visualized the Kernels at different layers.
Additional : Visualize the Feature map of horse image at different layers

Assignment 2 (Vision Transformer Fine-Tuning and Feature map Visualization)

: Go to Assignment 2 Folder

This repository contains code and notebooks for different fine-tuning methods of the Vision Transformer (ViT) model on the EuroSAT dataset, along with visualization of attention maps.

Fine-Tuning Notebooks

2_ViT_B12_Last_Layer_FT.ipynb
- Fine-tunes the last fully connected layer of the ViT model.
2_ViT_B12_Bottom_Layer_FT.ipynb
- Fine-tunes only the 8th to 11th transformer layers and the fully connected layer.
2_ViT_B12_Full_Fine_tune.ipynb
- Fine-tunes all layers of the ViT model.
2_ViT_B12_No_Fine_Tune.ipynb
- Does not perform fine-tuning on the ViT model.

Visualization

Visualization.ipynb
- Visualizes attention maps for different labels and models.

Model Performance

Epoch: 10

Model	Train Accuracy	Validation Accuracy	Test Accuracy
Only Last Layer Fine-Tune ViT	98.70%	96.48%	95.96%
Bottom Layer Fine-Tune ViT	99.53%	97.04%	97.11%
All Layer Fine-Tune ViT	91.48%	91.85%	90.74%
Do Not Fine-Tune ViT	13.55%	15.22%	13.92%

Observations

Last Layer Fine-Tuning:

Achieved a high validation accuracy of 95.96% quickly.
Efficient and competitive performance.

Bottom Layer Fine-Tuning:

Outperformed last layer fine-tuning with a validation accuracy of 97.33%.
Captured complex features effectively.

All Layer Fine-Tuning:

Initially lower performance but gradually improved.
Requires more training time.

Visualization Comparision:

Last layer fine-tuning and no fine-tuning exhibit similar attention maps across all transformer layers in the model. This is because we freeze all the transformer layers while fine-tuning.
Across all models, the initial layers of the transformer consistently show better attention map visualizations. This suggests that these layers focus on capturing low-level and fundamental features in the images, which are essential for understanding the dataset.
The attention map visualizations in the all layer fine-tuning case are not as informative compared to the others, as this strategy requires more training epochs.

Assignment 3 (Jigsaw and Patch prediction self-supervised(SSL) task)

Go to Assigment 3 folder

Self-Supervised Techniques:

Jigsaw_pretext_CIFAR10.ipynb
- Learn image feature representation through solving Jigsaw puzzle of jumbled patches of images.
- Defined Custom CNN model to handle 32 * 32 images.
- Jigsaw creation
  - Resize the given CIFAR-10 32 * 32 Image to 128 * 128 image shape.
  - Do center crop with 105 * 105 image shape.
  - Divide the image into 3 * 3 image patch, where Each patch size is 32 * 32.
  - Pass all 9, 32 * 32 image to model and predict the permutation index.
- Permutation Creation:
  - Choose the 1000 permutation of value 9 such that hamming distance between choose permutation is greater than 0.9
- Downstream task : Image Classification.
- GradCam Analysis
Jigsaw_inference.ipynb
- Jigsaw SSL and Downstream Task Inference on Test dataset
Patch_Prediction_CIFAR10.ipynb
- Learn image feature representation via finding the relative patch location of patches of images.
- Used resnet-18 model architecture.
- Resized the given Image to 96 * 96 shape and divide into 3 * 3 patches.
- Pass the center patch and neighbour patch to the model and it will predict the relative patch position with respect to center position.
- Downstream task : Image Classification.
- Inference on Test dataset.
- GradCam Analysis

Model Performance

Model	Validation Accuracy	Test Accuracy
Jigsaw SSL	95%	95%
Jigsaw SSL - Downstream Classification Task	67%	67%
Relative patch location SSL	98%	97%
Relative patch location SSL - Downsteam Classification Task	74%	74%

Paper Review 1:

Paper: ViViT: A Video Vision Transformer Paper Review

Paper Link

Paper Review 2:

Paper:Universal Domain Adaptation through Self-Supervision

Paper Link

Project Title:

Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains

RajGothi/Advance-Deep-learning-for-Computer-Vision-GNR-650-

GNR-650

Assignment 1 (Overfit Resnet 18 model):

Assignment 2 (Vision Transformer Fine-Tuning and Feature map Visualization)

Fine-Tuning Notebooks

Visualization

Model Performance

Epoch: 10

Observations

Last Layer Fine-Tuning:

Bottom Layer Fine-Tuning:

All Layer Fine-Tuning:

Visualization Comparision:

Assignment 3 (Jigsaw and Patch prediction self-supervised(SSL) task)

Self-Supervised Techniques:

Model Performance

Paper Review 1:

Paper Review 2:

Project Title: