Improving Visual-Semantic Embeddings with Hard Negatives

Code for the image-caption retrieval methods from "VSE++: Improving Visual-Semantic Embeddings with Hard Negatives" (Faghri, Fleet, Kiros, Fidler.
2017).

Dependencies

We recommended to use Anaconda for the following packages.

Python 2.7
PyTorch (>0.2)
NumPy (>1.12.1)
TensorBoard
Punkt Sentence Tokenizer:

import nltk
nltk.download()
> d punkt

Download data

Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy. The precomputed image features are from here and here. To use full image encoders, download the images from their original sources here, here and here.

wget http://www.cs.toronto.edu/~faghri/vsepp/vocab.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/data.tar
wget http://www.cs.toronto.edu/~faghri/vsepp/runs.tar

We refer to the path of extracted files for data.tar as $DATA_PATH and files for models.tar as $RUN_PATH. Extract vocab.tar to ./vocab directory.

Evaluate pre-trained models

from vocab import Vocabulary
import evaluation
evaluation.evalrank("$RUN_PATH/coco_vse++/model_best.pth.tar", data_path="$DATA_PATH", split="test")'

To do cross-validation on MSCOCO, pass fold5=True with a model trained using --data_name coco.

Training new models

Run train.py:

python train.py --data_path "$DATA_PATH" --data_name coco_precomp --logger_name 
runs/coco_vse++ --max_violation

Arguments used to train pre-trained models:

Method	Arguments
VSE0	`--no_imgnorm`
VSE++	`--max_violation`
Order0	`--measure order --use_abs --margin .05 --learning_rate .001`
Order++	`--measure order --max_violation`

Reference

If you found this code useful, please cite the following paper:

@article{faghri2017vse++,
  title={VSE++: Improving Visual-Semantic Embeddings with Hard Negatives},
  author={Faghri, Fartash and Fleet, David J and Kiros, Jamie Ryan and Fidler, Sanja},
  journal={arXiv preprint arXiv:1707.05612},
  year={2017}
}

License

Apache License 2.0