The official implementation of the paper:

VLFormer: Visual-Linguistic Transformer
for Referring Image Segmentation

Abstract

Under review

Main results on RefCOCO

Backbone	val	test A	test B
ResNet50	73.92	76.03	70.86
ResNet101	74.67	76.8	70.42

Main results on RefCOCO+

Backbone	val	test A	test B
ResNet50	64.02	69.74	55.04
ResNet101	64.80	70.33	56.33

Main results on G-Ref

Backbone	val	test
ResNet50	65.69	65.90
ResNet101	66.77	66.52

update here

We test our work in the following environments, other versions may also be compatible:

Please refer to installation.md for installation

Please refer to data.md for data preparation.

sh scripts/train.sh

python train_net_video.py --config-file <config-path> --num-gpus <?> OUTPUT_DIR <?>

for example, to train Resnet101-backbone model in RefCOCO dataset with 2 gpus:

python train_net_video.py --config-file configs/refcoco/VLFormer_R101_bs8_100k.yaml --num-gpus 2 OUTPUT_DIR output/refcoco-RN101

In terms of resuming the previous training, then add the flag --resume

python train_net_video.py --config-file <config-path> --num-gpus <?> --eval-only OUTPUT_DIR <output_dir> MODEL.WEIGHTS <weight_path>