This repository includes PyTorch implementation and trained models of VGTR(Visual Grounding with TRansformers).
[arXiv]
In this paper, we propose a transformer based approach for visual grounding. Unlike existing proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage detector by fusing textual embeddings, our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Termed as VGTR – Visual Grounding with TRansformers, our approach is designed to learn semantic-discriminative visual features under the guidance of the textual description without harming their location ability. This information flow enables our VGTR to have a strong capability in capturing context-level semantics of both vision and language modalities, rendering us to aggregate accurate visual clues implied by the description to locate the interested object instance. Experiments show that our method outperforms state-of-the-art proposal-free approaches by a considerable margin on four benchmarks.
- python 3.6
- pytorch>=1.6.0
- torchvision
- CUDA>=9.0
- others (opencv-python etc.)
-
Clone this repository.
-
Data preparation.
Download Flickr30K Entities from Flickr30k Entities (bryanplummer.com) and Flickr30K
Download MSCOCO images from MSCOCO
Download processed indexes from Gdrive, process by zyang-ur .
-
Download backbone weights. We use resnet-50/101 as the basic visual encoder. The weights are pretrained on MSCOCO, and can be downloaded here (BaiduDrive):
ResNet-50(code:ru8v); ResNet-101(code:0hgu).
-
Organize all files like this:
.
├── main.py
├── store
│ ├── data
│ │ ├── flickr
│ │ │ ├── corpus.pth
│ │ │ └── flickr_train.pth
│ │ ├── gref
│ │ └── gref_umd
│ ├── ln_data
│ │ ├── Flickr30k
│ │ │ └── flickr30k-images
│ │ └── other
│ │ └── images
│ ├── pretrained
│ │ └── flickr_R50.pth.tar
│ └── pth
│ └── resnet50_detr.pth
└── work
Dataset | Backbone | Accuracy | Pretrained Model (BaiduDrive) |
---|---|---|---|
Flickr30K Entites | Resnet50 | 74.17 | flickr_R50.pth.tar code: rpdr |
Flickr30K Entites | Resnet101 | 75.32 | flickr_R101.pth.tar code: 1igb |
RefCOCO | Resnet50 | 78.70 82.09 73.31 | refcoco_R50.pth.tar code: xjs8 |
RefCOCO | Resnet101 | 79.30 82.16 74.38 | refcoco_R101.pth.tar code: bv0z |
RefCOCO+ | Resnet50 | 63.57 69.65 55.33 | refcoco+_R50.pth.tar code: 521n |
RefCOCO+ | Resnet101 | 64.40 70.85 55.84 | refcoco+_R101.pth.tar code: vzld |
RefCOCOg | Resnet50 | 62.88 | refcocog_R50.pth.tar code: wb3x |
RefCOCOg | Resnet101 | 64.05 | refcocog_R101.pth.tar code: 5ok2 |
RefCOCOg-umd | Resnet50 | 65.62 65.30 | umd_R50.pth.tar code: 9lzr |
RefCOCOg-umd | Resnet101 | 66.83 67.28 | umd_R101.pth.tar code: zen0 |
python main.py \
--gpu $gpu_id \
--dataset $[refcoco | refcoco+ | others] \
--batch_size $bs \
--savename $exp_name \
--backbone $[resnet50 | resnet101] \
--cnn_path $resnet_coco_weight_path
Download the pretrained models and put it into the folder ./store/pretrained/
.
python main.py \
--test \
--gpu $gpu_id \
--dataset $[refcoco | refcoco+ | others] \
--batch_size $bs \
--pretrain $pretrained_weight_path
Part of codes are from:
- facebookresearch/detr;
- zyang-ur/onestage_grounding;
- andfoy/refer;
- jadore801120/attention-is-all-you-need-pytorch.
@article{du2021visual,
title={Visual grounding with transformers},
author={Du, Ye and Fu, Zehua and Liu, Qingjie and Wang, Yunhong},
journal={arXiv preprint arXiv:2105.04281},
year={2021}
}
@inproceedings{du2022visual,
title={Visual grounding with transformers},
author={Du, Ye and Fu, Zehua and Liu, Qingjie and Wang, Yunhong},
booktitle={Proceedings of the International Conference on Multimedia and Expo},
year={2022}
}