/ViTOL

ViTOL

Primary LanguagePythonMIT LicenseMIT

ViTOL: Vision Transformers for Weakly Supervised Object Localization

Official implementation of the paper ViTOL: Vision Transformer forWeakly Supervised Object Localization which is accepted as CVPRW-2022 paper for L3DIVU-2022.

This repository contains inference code and pre-trained model weights for our model in Pytorch framework. Code is trained and tested in Python 3.6.9 and Pytorch version 1.7.1+cu101

ViTOL-GAR Localization maps:

vitol

Model Zoo

We provide pre-trained weights for VITOL with DeiT-S and DeiT-B backbone on ImageNet-1k and CUB datasets below.

CUB: Updating soon

Results on ImageNet-1k dataset

Method MaxBoxAccV2 Top1Acc IOU50 Top1Cls
ViTOL-GAR Small 69.61 54.74 71.86 71.84
ViTOL-LRP Small 68.23 53.62 70.48 71.84
ViTOL-GAR Base 69.17 57.62 71.32 77.08
ViTOL-LRP Base 70.47 58.64 72.51 77.08

Results on CUB dataset

updating soon

Usage

Clone the repository git clone https://github.com/Saurav-31/ViTOL.git

Setup conda environment

conda env create -f environment.yml
conda activate vitol

Dataset preparation

Please refer here for dataset preparation

Inference results on ImageNet

Edit the config files under configs folder
1. Add paths to ImageNet dataset
--data_root=\PATH\TO\DATASET
--metadata_root=\PATH\TO\GROUND_TRUTH 
2. Download ViTOL weights and copy to directory named "pretrained_weights"

--CHECKPOINT_NAME=$VITOL_WEIGHTS_TAR_FILENAME

RUN ViTOL Base with GAR

bash evaluate.sh configs/ilsvrc/ViTOL_GAR_base.yml

RUN ViTOL Small with GAR

bash evaluate.sh configs/ilsvrc/ViTOL_GAR_small.yml

To do
  • Setup Training Code for the same
  • Train the model with more stronger backbones
  • Jupyter notebook for visualization
We borrow code from

Evaluating Weakly Supervised Object Localization Methods Right (CVPR 2020) Transformer Interpretability Beyond Attention Visualization (CVPR 2021)

Contacts

If you have any question about our work or this repository, please don't hesitate to contact us by emails.

Citation

If you find this work useful, please cite as follows:

@inproceedings{gupta2022vitol,
  title={ViTOL: Vision Transformer for Weakly Supervised Object Localization},
  author={Gupta, Saurav and Lakhotia, Sourav and Rawat, Abhay and Tallamraju, Rahul},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={4101--4110},
  year={2022}
}