DDPN

This project is the implementation of the paper Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding.The network architecture with DDPN for our visual grounding model is illustrated in Figure 1.

Figure 1: The model network architecture for our visual grounding model.

Requirements

Python version 2.7
easydict
cv2
Pytorch 0.3 (optional, used for speed-up multi-threads data loading, recommend)

Pretrained Models

We release the trained models on four datasets, which achieve slightly better results than that shown in the paper.

Datasets	Flickr30k-Entities	Referit	Refcoco	Refcoco+
val	72.78%	63.77%	76.61%	64.34%
test	73.45%	63.27%	76.23%	64.01%
testA			79.99%	71.24%
testB			72.11%	55.55%

Download pretrained models BaiduYun
Unzip the model files in directory './pretrained_model'.

Preprocess

Caffe
```
cd ./caffe
make all -j32
make pycaffe
```
Download Images, Images only
- flickr30k-entities
  - download the Flickr30k-Entities images
  - move flickr30k-entities images to directory './data/flickr30k/flickr30k-images/'.
- referit, download the Referit Images.
```
wget -O ./data/referit/ImageCLEF/referitdata.tar.gz http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/referitdata.tar.gz
tar -xzvf ./data/referit/ImageCLEF/referitdata.tar.gz -C ./data/referit/ImageCLEF/
```
- refcoco/refcoco+, download the mscoco train2014 Images
  - mscoco train2014.
  - move images of mscoco train2014 to directory './data/mscoco/image2014/train2014/'
Extract DDPN image features. For a 3xhxw image, we extract the 2048-D visual feature and 4-D spatial feature (post-processed to 5-D) as the input feature for our model. The script we use is as follows. Note that we use --num_bbox 100,100 to extract a fix number of proposals (K=100) for each image.
```
./tools/extract_feat.py --gpu 0,1,2,3 --cfg experiments/cfgs/faster_rcnn_end2end_resnet_vg.yml --def models/vg/ResNet-101/faster_rcnn_end2end/test.prototxt --net /path/to/caffemodel --img_dir /path/to/images/ --out_dir /path/to/outfeat/ --num_bbox 100,100 --feat_name pool5_flat
```
- For flickr30k or referit we output the images features in directory 'data/[flickr30k, referit]/features/bottom-up-feats/' by default. And for refcoco/refcoco+ we output the images features in 'data/mscoco/features/bottom-up-feats/train2014'.
Download Annotation files, we preprocess the annotations of flickr30k-entities, referit, refcoco, refcoco+ which makes all kind of data to be in same format, download our processed annotations here, BaiduYun, then unzip these zip files in directory './data'. We will release the code for preprocessing annotation in directory './preprocess'.
Modify the paths in the config file to adapt to your own environment, set data loader threads and images features dir and images dir in yaml config files in directory './config/experiments/'.

Training

flickr30k-entities

python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/flickr30k-kld-bbox_reg.yaml

referit

python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/referit-kld-bbox_reg.yaml

refcoco

python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/refcoco-kld-bbox_reg.yaml

refcoco+

python train_net.py --gpu_id 0 --train_split train --val_split val --cfg config/experiments/refcoco+-kld-bbox_reg.yaml

Output model will be put in directory './models'
Validation log output will be writen in directory './log'

Testing