RefCLIP

Python PyTorch

This is the official implementation of "RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension". In this paper,we propose a novel one-stage contrastive model called RefCLIP, which achieves weakly supervised REC via anchor-based cross-modal contrastive learning. Based on RefCLIP, we propose a weakly supervised training scheme for common REC models, that is, to train any REC model by means of pseudo-labels.

Installation

  • Clone this repo
git clone https://github.com/AnonymousPaperID5299/RefCLIP.git
cd RefCLIP
  • Create a conda virtual environment and activate it
conda create -n refclip python=3.7 -y
conda activate refclip
cd utils/DCN
./make.sh
pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz

Data Preparation

  • Download images and Generate annotations according to SimREC.

  • Download the pretrained weights of YoloV3 from OneDrive.

  • The project structure should look like the following:

| -- RefCLIP
     | -- data
        | -- anns
            | -- refcoco.json
            | -- refcoco+.json
            | -- refcocog.json
            | -- refclef.json
        | -- images
            | -- train2014
                | -- COCO_train2014_000000000072.jpg
                | -- ...
            | -- refclef
                | -- 25.jpg
                | -- ...
     | -- config
     | -- datasets
     | -- models
     | -- utils
  • NOTE: our YoloV3 is trained on COCO’s training images, excluding those in RefCOCO, RefCOCO+, and RefCOCOg’s validation+testing.

RefCLIP

Training

python train.py --config ./config/[DATASET_NAME].yaml

Evaluation

python test.py --config ./config/[DATASET_NAME].yaml --eval-weights [PATH_TO_CHECKPOINT_FILE]

Weakly Supervised Training Scheme

Model Zoo

RefCLIP

Method RefCOCO RefCOCO+ RefCOCOg ReferItGame
val testA testB val testA testB val-g test
RefCLIP 60.36 58.58 57.13 40.39 40.45 38.86 47.87 39.58

Weakly Supervised Training Scheme

Method RefCOCO RefCOCO+ RefCOCOg ReferItGame
val testA testB val testA testB val-g test
RefCLIP_RealGIN 59.43 58.49 57.36 37.08 38.70 35.82 46.10 37.56
RefCLIP_SimREC 62.57 62.70 61.22 39.13 40.81 36.59 45.68 42.33
RefCLIP_TransVG 64.08 63.67 63.93 39.32 39.54 36.29 45.70 42.64
RefCLIP_MattNet 69.31 67.23 71.27 43.01 44.80 41.09 51.31 -

Citation

@InProceedings{Jin_2023_CVPR,
    author    = {Jin, Lei and Luo, Gen and Zhou, Yiyi and Sun, Xiaoshuai and Jiang, Guannan and Shu, Annan and Ji, Rongrong},
    title     = {RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {2681-2690}
}

Acknowledgement

Thanks a lot for the nicely organized code from the following repos