"For millions of years mankind lived just like the animals. Then something happened which unleashed the power of our imagination: we learned to talk".
This project reproduces the model from Show and Tell: A Neural Image Caption Generator
Image features are the outputs of the relu7
layer from the VGG network which you can download here.
Remove the drop7
, fc8
, prob
layers from .prototxt file, so the last layer must be relu7
You can download prepared training and validation data from my google drive or you can reproduce image/text feature extraction pipeline as following:
- Download datasets
- Run python scripts for generating files which store the image paths and corresponding captions
- run
data_preparation/flickr/flickr8k/build_image_text_match.py
- run
data_preparation/flickr/flickr30k/build_image_text_match.py
- run
data_preparation/mscoco/build_image_text_match.py
- run
- Run python scripts for generating files which store image features
- run
data_preparation/flickr/extract_features.py
- run
data_preparation/mscoco/extract_features.py
- run
- Run python scripts for generating training and validation data
- run
data_preparation/merge_all_data.py
- run
To train model run caption_generation_model/train.py
or you can download pretrained model from my google drive
If you want to use the pretrained model run minimalistic flask app caption_generation_server/app.py
(Note: it requires installed caffe and its python interface pycaffe
)