/TransCP

[TPAMI 2024] This is the Pytorch code for our paper "Context Disentangling and Prototype Inheriting for Robust Visual Grounding".

Primary LanguagePython

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Wei Tang*,1Liang Li2Xuejing Liu3Lu Jin1Jinhui Tang1Zechao Li✉,1
1Nanjing University of Science and Technology; 2Institute of Computing Technology, Chinese Academy of Science; 3SenseTime Research 
Corresponding Author

Hits

Updates

  • 28 Nov, 2023: 💥💥 Our paper "Context Disentangling and Prototype Inheriting for Robust Visual Grounding" has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  • 3 june, 2024: 💥💥 The codes have been released.
  • 19 june, 2024: 💥💥 The ckpts have been released.

This repository contains the official implementation and checkpoints of the following paper:

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Abstract: Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios. The code is available at https://github.com/WayneTomas/TransCP.

Todo

  1. Update the README.
  2. Release the codes.
  3. Release the checkpoints.
  4. [] Release the adapted compared methods codes/ckpts

Get Start

Install

git clone https://github.com/WayneTomas/TransCP.git
conda create -n pytorch1.7 python=3.6.13
conda activate pytorch1.7
pip install -r requirements.txt

Dataset

Please follow the instruction of VLTVG/TransVG for dataset preparation.

Checkpoint

checkpoints

The original results reported in the paper are from the model trained on 2 GTX 3090; The re-implement results are from the model trained on 2 V100;

referit test
original 72.05%
re-implement 72.56%
flickr30k entities test
original 80.04%
re-implement 79.47%
refcoco val testA testB
original 84.25% 87.38% 79.78%
re-implement 84.62% 87.36% 80.00%
refcoco+ val testA testB
original 73.07% 78.05% 63.35%
re-implement 73.09% 78.27% 63.14%
refcocog val
original 72.60%
re-implement 72.14%

Train

The following is an example of model training on the RefCOCO dataset.

python -m torch.distributed.launch --nproc_per_node=2 --master_port=29516 train.py --config configs/TransCP_R50_unc.py

Inference

For stanadard scene: train on RefCOCO train, test on RefCOCO-testB split

python -m torch.distributed.launch --nproc_per_node=1 --master_port=29516 test.py --config configs/TransCP_R50_unc.py --checkpoint outputs/unc/public/checkpoint_best_acc.pth --batch_size_test 16 --test_split testB

For open-vocabulary scene: train on Ref-Reasoning, test on RefCOCO-testB split

python -m torch.distributed.launch --nproc_per_node=1 --master_port=29539 --use_env test.py --config configs/TransCP_R50_unc.py --checkpoint outputs/ref_reasoning/publick/checkpoint_best_acc.pth --batch_size_test 16 --test_split testB

Cite

@article{tang2023context,
  title={Context Disentangling and Prototype Inheriting for Robust Visual Grounding},
  author={Wei, Tang and Liang, Li and Xuejing Liu and Lu Jin and Jinhui Tang and Zechao, Li},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  DOI: 10.1109/TPAMI.2023.3339628
  year={2023}
}
paper link: https://arxiv.org/pdf/2312.11967

Acknowledgement

Part of our code is based on the previous works DETR, TransVG, and VLTVG, thanks for the authors. And we thank for Prof. Sibei Yang for providing the Ref-Reasoning dataset.