/vgtr

[ICME'22] Visual Grounding with Transformers

Primary LanguagePythonMIT LicenseMIT

Visual Grounding with Transformers

Overview

This repository includes PyTorch implementation and trained models of VGTR(Visual Grounding with TRansformers).

[arXiv]

In this paper, we propose a transformer based approach for visual grounding. Unlike existing proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage detector by fusing textual embeddings, our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Termed as VGTR – Visual Grounding with TRansformers, our approach is designed to learn semantic-discriminative visual features under the guidance of the textual description without harming their location ability. This information flow enables our VGTR to have a strong capability in capturing context-level semantics of both vision and language modalities, rendering us to aggregate accurate visual clues implied by the description to locate the interested object instance. Experiments show that our method outperforms state-of-the-art proposal-free approaches by a considerable margin on four benchmarks.

图片

Prerequisites

  • python 3.6
  • pytorch>=1.6.0
  • torchvision
  • CUDA>=9.0
  • others (opencv-python etc.)

Preparation

  1. Clone this repository.

  2. Data preparation.

    Download Flickr30K Entities from Flickr30k Entities (bryanplummer.com) and Flickr30K

    Download MSCOCO images from MSCOCO

    Download processed indexes from Gdrive, process by zyang-ur .

  3. Download backbone weights. We use resnet-50/101 as the basic visual encoder. The weights are pretrained on MSCOCO, and can be downloaded here (BaiduDrive):

    ResNet-50(code:ru8v); ResNet-101(code:0hgu).

  4. Organize all files like this:

.
├── main.py
├── store
│   ├── data
│   │   ├── flickr
│   │   │   ├── corpus.pth
│   │   │   └── flickr_train.pth
│   │   ├── gref
│   │   └── gref_umd
│   ├── ln_data
│   │   ├── Flickr30k
│   │   │   └── flickr30k-images
│   │   └── other
│   │       └── images
│   ├── pretrained
│   │   └── flickr_R50.pth.tar
│   └── pth
│       └── resnet50_detr.pth
└── work

Model Zoo

Dataset Backbone Accuracy Pretrained Model (BaiduDrive)
Flickr30K Entites Resnet50 74.17 flickr_R50.pth.tar code: rpdr
Flickr30K Entites Resnet101 75.32 flickr_R101.pth.tar code: 1igb
RefCOCO Resnet50 78.70 82.09 73.31 refcoco_R50.pth.tar code: xjs8
RefCOCO Resnet101 79.30 82.16 74.38 refcoco_R101.pth.tar code: bv0z
RefCOCO+ Resnet50 63.57 69.65 55.33 refcoco+_R50.pth.tar code: 521n
RefCOCO+ Resnet101 64.40 70.85 55.84 refcoco+_R101.pth.tar code: vzld
RefCOCOg Resnet50 62.88 refcocog_R50.pth.tar code: wb3x
RefCOCOg Resnet101 64.05 refcocog_R101.pth.tar code: 5ok2
RefCOCOg-umd Resnet50 65.62 65.30 umd_R50.pth.tar code: 9lzr
RefCOCOg-umd Resnet101 66.83 67.28 umd_R101.pth.tar code: zen0

Train

python main.py \
   --gpu $gpu_id \
   --dataset $[refcoco | refcoco+ | others] \
   --batch_size $bs \
   --savename $exp_name \
   --backbone $[resnet50 | resnet101] \
   --cnn_path $resnet_coco_weight_path

Inference

Download the pretrained models and put it into the folder ./store/pretrained/.

python main.py \
   --test \
   --gpu $gpu_id \
   --dataset $[refcoco | refcoco+ | others] \
   --batch_size $bs \
   --pretrain $pretrained_weight_path

Acknowledgements

Part of codes are from:

  1. facebookresearch/detr
  2. zyang-ur/onestage_grounding
  3. andfoy/refer
  4. jadore801120/attention-is-all-you-need-pytorch.

Citation

@article{du2021visual,
  title={Visual grounding with transformers},
  author={Du, Ye and Fu, Zehua and Liu, Qingjie and Wang, Yunhong},
  journal={arXiv preprint arXiv:2105.04281},
  year={2021}
}

@inproceedings{du2022visual,
  title={Visual grounding with transformers},
  author={Du, Ye and Fu, Zehua and Liu, Qingjie and Wang, Yunhong},
  booktitle={Proceedings of the International Conference on Multimedia and Expo},
  year={2022}
}