The official implementation of the paper:

VLFormer: Visual-Linguistic Transformer
for Referring Image Segmentation

Abstract

Under review

Demo

Main Results

Main results on RefCOCO

Backbone val test A test B
ResNet50 73.92 76.03 70.86
ResNet101 74.67 76.8 70.42

Main results on RefCOCO+

Backbone val test A test B
ResNet50 64.02 69.74 55.04
ResNet101 64.80 70.33 56.33

Main results on G-Ref

Backbone val test
ResNet50 65.69 65.90
ResNet101 66.77 66.52

Update

update here

Requirements

We test our work in the following environments, other versions may also be compatible:

  • CUDA 11.1
  • Python 3.8
  • Pytorch 1.9.0

Installation

Please refer to installation.md for installation

Data preparation

Please refer to data.md for data preparation.

Training

sh scripts/train.sh 

or

python train_net_video.py --config-file <config-path> --num-gpus <?> OUTPUT_DIR <?>

for example, to train Resnet101-backbone model in RefCOCO dataset with 2 gpus:

python train_net_video.py --config-file configs/refcoco/VLFormer_R101_bs8_100k.yaml --num-gpus 2 OUTPUT_DIR output/refcoco-RN101

In terms of resuming the previous training, then add the flag --resume

Evaluation

python train_net_video.py --config-file <config-path> --num-gpus <?> --eval-only OUTPUT_DIR <output_dir> MODEL.WEIGHTS <weight_path>