An attention-based image description model.
This code is based on Kelvin Xu's arctic captions described in Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.
Changes:
- Create input data from a directory of images and a JSON file concerning the descriptions.
- Gradient norm and value clipping.
- Most recent version of the ADAM optimiser (v8).
- Monitor training performance using external metrics.
- Python 2.7
- Theano
- NumPy
- scikit learn
- skimage
- PyTables (for reading the image features)
To extract visual features from your own images and to create training, validation, and test inputs files, you will need:
- Caffe built with the Python bindings (if you want to extract visual features by yourself)
To use the evaluation script (metrics.py
): see
coco-caption for the
requirements. Install coco-caption
in evaluate/
and create an
empty __init__.py
in evaluate/
so it can be imported as a module.
You can download pre-extracted training, dev, and test features for the Flickr30K dataset:
- The HDF5 file contains the CONV_5,4 image feature vectors. Each
image vector is stored as a flattened (14, 14, 512)
ndarray
, which will be reshaped into a (14x14, 512)ndarray
when it is used by the model.
You can download the pre-extracted training, development, and test sentences and the dictionary.
- The numpy file contains a list of ((sentence, index)) tuples. The
index
entry is directly mapped to theindex
of the visual feature vector in the HDF5 file.
You can also download a pre-trained model for this dataset.
make_dataset.py
takes care of creating the image features file and
the sentences file. See make_dataset.py
for instructions on how to
create dataset files from your data.
If you create a new dataset, you will need to create a new dataset
loader module to work with your new dataset. See flickr30k.py
for
how to do this.
You can train a model using THEANO_FLAGS=floatX=float32 python train_model.py
. See the documentation in train_model.py
and
model.py
for more information on the options.
If you want to use the metrics.py
script to control training the
model (e.g. save model parameters based on Meteor or CIDEr), then pass
"{'use_metrics':'True'}"
as an argument to train_model.py
and
install the dependencies for the
coco-caption for the
requirements.
Generate descriptions using THEANO_FLAGS=floatX=float32 python generate_caps.py $model_name $PREFIX
. This will generate descriptions
into $PREFIX.dev.txt
and $PREFIX.test.txt
. Use the --dataset $DATASET_NAME
argument to generate descriptions of images in a
different dataset.
If you use this code as part of any published research, please acknowledge the following paper (it encourages researchers who publish their code!):
"Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML (2015)
The code is released under a revised (3-clause) BSD License.