ViTOL: Vision Transformers for Weakly Supervised Object Localization

Official implementation of the paper ViTOL: Vision Transformer forWeakly Supervised Object Localization which is accepted as CVPRW-2022 paper for L3DIVU-2022.

This repository contains inference code and pre-trained model weights for our model in Pytorch framework. Code is trained and tested in Python 3.6.9 and Pytorch version 1.7.1+cu101

ViTOL-GAR Localization maps:

Model Zoo

We provide pre-trained weights for VITOL with DeiT-S and DeiT-B backbone on ImageNet-1k and CUB datasets below.

ImageNet: ViTOL-base, ViTOL-small

CUB: Updating soon

Results on ImageNet-1k dataset

Method	MaxBoxAccV2	Top1Acc	IOU50	Top1Cls
ViTOL-GAR Small	69.61	54.74	71.86	71.84
ViTOL-LRP Small	68.23	53.62	70.48	71.84
ViTOL-GAR Base	69.17	57.62	71.32	77.08
ViTOL-LRP Base	70.47	58.64	72.51	77.08

Results on CUB dataset

updating soon

Usage

Clone the repository git clone https://github.com/Saurav-31/ViTOL.git

Setup conda environment

conda env create -f environment.yml
conda activate vitol

Dataset preparation

Please refer here for dataset preparation

Inference results on ImageNet

Edit the config files under configs folder

1. Add paths to ImageNet dataset

--data_root=\PATH\TO\DATASET
--metadata_root=\PATH\TO\GROUND_TRUTH

2. Download ViTOL weights and copy to directory named "pretrained_weights"

--CHECKPOINT_NAME=$VITOL_WEIGHTS_TAR_FILENAME

RUN ViTOL Base with GAR

bash evaluate.sh configs/ilsvrc/ViTOL_GAR_base.yml

RUN ViTOL Small with GAR

bash evaluate.sh configs/ilsvrc/ViTOL_GAR_small.yml

To do

Setup Training Code for the same
Train the model with more stronger backbones
Jupyter notebook for visualization

We borrow code from

_{Evaluating Weakly Supervised Object Localization Methods Right (CVPR 2020)} _{Transformer Interpretability Beyond Attention Visualization (CVPR 2021)}

Contacts

If you have any question about our work or this repository, please don't hesitate to contact us by emails.

Citation

If you find this work useful, please cite as follows:

@inproceedings{gupta2022vitol,
  title={ViTOL: Vision Transformer for Weakly Supervised Object Localization},
  author={Gupta, Saurav and Lakhotia, Sourav and Rawat, Abhay and Tallamraju, Rahul},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={4101--4110},
  year={2022}
}