
Official Implementation for paper "Referring Transformer: A One-step Approach to Multi-task Visual Grounding" Neurips 2021

Primary LanguagePythonMIT LicenseMIT


Code for paper "Referring Transformer: A One-step Approach to Multi-task Visual Grounding"


To install requirements:

pip install -r requirements.txt
chmod +x tools/run_dist_slurm.sh 

Setting up dataset

Resc annotations preparation: https://drive.google.com/file/d/1fVwdDvXNbH8uuq_pHD_o5HI7yqeuz0yS/view?usp=sharing

Flicker30k Entities: http://bryanplummer.com/Flickr30kEntities/

MSCOCO: http://mscoco.org/dataset/#overview

Visual Genome Images: https://visualgenome.org/api/v0/api_home.html

data/annotations: https://drive.google.com/file/d/19qJ8b5sxijKmtN0XG9leWbt2sPkIVqlc/view?usp=sharing

refcoco/masks: https://drive.google.com/file/d/1oGUewiDtxjouT8Qp4dRzrPfGkc0LZaIT/view?usp=sharing

refcoco/anns: https://drive.google.com/file/d/1Prhrgm3t2JeY68Ni_1Ig_a4dfZvGC9vZ/view?usp=sharing

annotations_resc/vg/vg_all.pth: https://drive.google.com/file/d/1_GbWl0sSB1y26fFM9W7DDkXLRR8Ld3IH/view?usp=sharing

Extract dataset in the /data folder.(Tips: you can use softlinks to avoid putting data and code in the same directory.) The data/ folder should look like this:

├── annotations
├── annotations_resc
│   ├── flickr
│   ├── gref
│   ├── gref_umd
│   ├── referit
│   ├── unc
│   ├── unc+
│   └── vg
├── flickr30k
│   └── f30k_images
├── refcoco
|   ├── anns
│   ├── images
|   │   ├──train2014  # images from train 2014
│   ├── masks
├── referit
│   ├── images
├── visualgenome
└───└──  VG_100K


To train the model, run:

# using slurm system
MASTER_PORT=${Master Port} GPUS_PER_NODE={GPU per node} ./tools/run_dist_slurm.sh RefTR ${Number Of GPU} ${config file name}


MASTER_PORT=29501 GPUS_PER_NODE=4  ./tools/run_dist_slurm.sh  RefTR 4 configs/flickr30k/RefTR_flickr.sh 


To evaluate the model, run:

MASTER_PORT=${Master Port} GPUS_PER_NODE={GPU per node} ./tools/run_dist_slurm.sh RefTR ${Number Of GPU} ${config file name} --eval --resume=${path to checkpoint}


MASTER_PORT=29501 GPUS_PER_NODE=4  ./tools/run_dist_slurm.sh  RefTR 4 configs/flickr30k/RefTR_flickr.sh --eval --resume=./exps/flickr30k/checkpoint.pth

Pretrained checkpoint for refcoco res/rec

Checkpoint Name Dataset/Link Description
refcoco_SEG_PT_res50_6_epochs.pth refcoco Pretrained 6 epochs on VG
refcoco+_SEG_PT_res50_6_epochs.pth refcoco+ Pretrained 6 epochs on VG
refcocog_SEG_PT_res50_6_epochs.pth refcocog Pretrained 6 epochs on VG


If you find this code is useful for your research, please cite our paper

  title={Referring Transformer: A One-step Approach to Multi-task Visual Grounding},
  author={Muchen, Li and Leonid, Sigal},
  booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},