/DDPN-1

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

Primary LanguageJupyter Notebook

DDPN

This project is the implementation of the paper Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding.The network architecture with DDPN for our visual grounding model is illustrated in Figure 1.

Figure 1: The model architecture for our visual grounding model.

Figure 1: The model network architecture for our visual grounding model.

Requirements

  • Python version 2.7
  • easydict
  • cv2
  • Pytorch 0.3 (optional, used for speed-up multi-threads data loading, recommend)

Pretrained Models

We release the trained models on four datasets, which achieve slightly better results than that shown in the paper.

Datasets Flickr30k-Entities Referit Refcoco Refcoco+
val 72.78% 63.77% 76.61% 64.34%
test 73.45% 63.27% 76.23% 64.01%
testA 79.99% 71.24%
testB 72.11% 55.55%
  1. Download pretrained models BaiduYun
  2. Unzip the model files in directory './pretrained_model'.

Preprocess

  • Caffe

    cd ./caffe
    make all -j32
    make pycaffe
    
  • Download Images, Images only

    • flickr30k-entities
    • referit, download the Referit Images.
      wget -O ./data/referit/ImageCLEF/referitdata.tar.gz http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/referitdata.tar.gz
      tar -xzvf ./data/referit/ImageCLEF/referitdata.tar.gz -C ./data/referit/ImageCLEF/
      
    • refcoco/refcoco+, download the mscoco train2014 Images
      • mscoco train2014.
      • move images of mscoco train2014 to directory './data/mscoco/image2014/train2014/'
  • Extract DDPN image features. For a 3xhxw image, we extract the 2048-D visual feature and 4-D spatial feature (post-processed to 5-D) as the input feature for our model. The script we use is as follows. Note that we use --num_bbox 100,100 to extract a fix number of proposals (K=100) for each image.

    ./tools/extract_feat.py --gpu 0,1,2,3 --cfg experiments/cfgs/faster_rcnn_end2end_resnet_vg.yml --def models/vg/ResNet-101/faster_rcnn_end2end/test.prototxt --net /path/to/caffemodel --img_dir /path/to/images/ --out_dir /path/to/outfeat/ --num_bbox 100,100 --feat_name pool5_flat
    
    • For flickr30k or referit we output the images features in directory 'data/[flickr30k, referit]/features/bottom-up-feats/' by default. And for refcoco/refcoco+ we output the images features in 'data/mscoco/features/bottom-up-feats/train2014'.
  • Download Annotation files, we preprocess the annotations of flickr30k-entities, referit, refcoco, refcoco+ which makes all kind of data to be in same format, download our processed annotations here, BaiduYun, then unzip these zip files in directory './data'. We will release the code for preprocessing annotation in directory './preprocess'.

  • Modify the paths in the config file to adapt to your own environment, set data loader threads and images features dir and images dir in yaml config files in directory './config/experiments/'.

Training

  • flickr30k-entities
    python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/flickr30k-kld-bbox_reg.yaml
    
  • referit
    python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/referit-kld-bbox_reg.yaml
    
  • refcoco
    python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/refcoco-kld-bbox_reg.yaml
    
  • refcoco+
    python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/refcoco+-kld-bbox_reg.yaml
    
  • Output model will be put in directory './models'
  • Validation log output will be writen in directory './log'

Testing

  • flickr30k-entities
    python test_net.py --gpu_id 0 --test_split test --batchsize 64 --test_net pretrained_model/flickr30k/test.prototxt --pretrained_model pretrained_model/flickr30k/final.caffemodel --cfg config/experiments/flickr30k-kld-bbox_reg.yaml
    
  • referit
    python test_net.py --gpu_id 0 --test_split test --batchsize 64 --test_net pretrained_model/referit/test.prototxt --pretrained_model pretrained_model/referit/final.caffemodel --cfg config/experiments/referit-kld-bbox_reg.yaml
    
  • refcoco
    python test_net.py --gpu_id 0 --test_split test --batchsize 64 --test_net pretrained_model/refcoco/test.prototxt --pretrained_model pretrained_model/refcoco/final.caffemodel --cfg config/experiments/refcoco-kld-bbox_reg.yaml
    
  • refcoco+
    python test_net.py --gpu_id 0 --test_split test --batchsize 64 --test_net pretrained_model/refcoco+/test.prototxt --pretrained_model pretrained_model/refcoco+/final.caffemodel --cfg config/experiments/refcoco+-kld-bbox_reg.yaml
    

Citation

If the codes are helpful for your research, please cite

@article{yu2018rethining,
  title={Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Zhao, Zhou and Tian, Qi and Tao, Dacheng},
  journal={International Joint Conference on Artificial Intelligence (IJCAI)},
  year={2018}
}