
Neural Image Captioning in TensorFlow.

Primary LanguageJupyter NotebookMIT LicenseMIT


Demo: http://www.fregu856.com/image_captioning
Poster: https://goo.gl/1DMQVE
Report: https://goo.gl/PzgRf5

Installed packages (all just pip install on linux):
Cython (for the COCO PythonAPI)

Clone the Tensorflow models repo: https://github.com/tensorflow/models

Download the Inception-V3 model to where you want it (in my case to ~/CS224n/Project/CS224n_project/inception):
$ cd models/tutorials/image/imagenet
$ python classify_image.py --model_dir ~/CS224n/project/CS224n_project/inception

How to extract features from the second-to-last layer of the pretrained CNN:

Dataset: Microsoft COCO:

Clone/download and place the "coco" folder in your project directory:

Download the training images and place in coco/images/train:
$ wget "http://msvocds.blob.core.windows.net/coco2014/train2014.zip"
$ unzip train2014.zip

Download the validation images:
$ wget "http://msvocds.blob.core.windows.net/coco2014/val2014.zip"
$ unzip val2014.zip
Place 5000 of the validation images in coco/images/val, 5000 in coco/images/test and the rest in coco/images/train.

Download the captions (captions_train2014.json and captions_val2014.json) and place in:

To install the Python API:
$ cd coco/PythonAPI
$ make

Demo of the PythonAPI:

For evaluation of captions:

Clone coco-caption and place in the coco folder in the project directory:
Make sure java is installed:
$ sudo apt-get install default-jdk

For initialization of the embedding matrix with GloVe vectors:

Download glove.6B.zip from https://nlp.stanford.edu/projects/glove/ and place glove.6B.300d.txt in coco/annotations.



  • ASSUMES: that preprocess_captions.py, extract_img_features_attention.py and create_initial_embeddings.py has already been run.
  • DOES: defines the GRU_attention model and contains a script for training the model (basically identical to LSTM_attention_model.py).


  • ASSUMES: that preprocess_captions.py, extract_img_features.py and create_initial_embeddings.py has already been run.
  • DOES: defines the GRU model and contains a script for training the model (basically identical to LSTM_model.py).


  • ASSUMES: that preprocess_captions.py, extract_img_features_attention.py and create_initial_embeddings.py has already been run.
  • DOES: defines the LSTM_attention model and contains a script for training the model.


  • ASSUMES: that preprocess_captions.py, extract_img_features.py and create_initial_embeddings.py has already been run.
  • DOES: defines the LSTM model and contains a script for training the model.


  • Must be called in one of the following ways: $ caption_img.py LSTM (for using the best LSTM model) $ caption_img.py LSTM_attention (for using the best LSTM_attention model) $ caption_img.py GRU (for using the best GRU model) $ caption_img.py GRU_attention (for using the best GRU_attention model)
  • ASSUMES: that preprocess_captions.py has already been run. That the image one would like to generate a caption for is called "img.jpg" and is placed in the directory "img_to_caption". That the weights for the best LSTM/GRU/LSTM_attention/GRU_attention model has been placed in models/model_type/best_model with names model.filetype.
  • DOES: generates a caption for "img.jpg" using the best model of the specified model type and displays the img and its caption. For attention models, it also displays a figure visualizing the img attention at the time of prediciton for each word in the caption.


  • Must be called in one of the following ways: $ caption_img.py LSTM [img_id] (for using the best LSTM model) $ caption_img.py LSTM_attention [img_id] (for using the best LSTM_attention model) $ caption_img.py GRU [img_id] (for using te best GRU model) $ caption_img.py GRU_attention [img_id] (for using the best GRU_attention model)
  • ASSUMES: that preprocess_captions.py and extract_img_features.py has already been run. That the weights for the best LSTM/GRU/LSTM_attention/GRU_attention model has been placed in models/model_type/best_model with names model.filetype.
  • DOES: generates a caption for the test img with img id img_id if specified, otherwise for a random test img. It also displays the img and its caption. For attention models, it also displays a figure visualizing the img attention at the time of prediciton for each word in the caption.


  • ASSUMES: that "preprocess_captions.py" already has been run.
  • DOES: creates a word embedding matrix (embeddings_matrix) using GloVe vectors.


  • ASSUMES: that preprocess_captions.py, extract_img_features.py and extract_img_features_attention.py has already been run. That the weights for the best LSTM/GRU/LSTM_attention/GRU_attention model has been placed in models/model_type/best_model with names model.filetype.
  • DOES: generates captions for all 5000 imgs in test using the best LSTM/GRU/LSTM_attention/GRU_attention model, evaluates the captions and returns the metric scores (BLEU-1, BLEU-2, BLEU-3, BLEU-4, CIDEr, METEOR and ROUGE_L).


  • ASSUMES: that the image dataset has been manually split such that all train images are stored in "coco/images/train/", all test images are stored in "coco/images/test/" and all val images are stored in "coco/images/val". That the Inception-V3 model has been downloaded and placed in inception.
  • DOES: extracts a 2048 dimensional feature vector for each train/val/test img and creates dicts mapping from img id to feature vector ( train/val/test_img_id_2_feature_vector).


  • ASSUMES: that the image dataset has been manually split such that all train images are stored in "coco/images/train/", all test images are stored in "coco/images/test/" and all val images are stored in "coco/images/val". That the Inception-V3 model has been downloaded and placed in inception. That the dict numpy_params (containing W_img and b_img taken from the img_transform step in a well-performing non-attention model) is placed in coco/data/img_features_attention/transform_params.
  • DOES: extracts a 64x300 feature array (64 300 dimensional feature vectors, one each for 8x8 different img regions) for each train/val/test img and saves each individual feature array to disk (to coco/data/img_features_attention). Is used in the attention models.


  • ASSUMES: that "split_img_ids.py" already has been run. That the COCO Python API has been installed. That the files captions_train2014.json, captions_val2014.json and glove.6B.300d.txt is placed in coco/annotations. That the folder coco/data exists.
  • DOES: all necessary pre-processing of the captions. Creates a number of files, see all "cPickle.dump" below.


  • ASSUMES: that the image dataset has been manually split such that all test images are stored in "coco/images/test/" and all val images are stored in "coco/images/val".
  • DOES: creates two files (val_img_ids, test_img_ids) containing the img ids for all val and test imgs, respectively. Is later used to sort an img as either train, val or test.


  • DOES: contains a bunch of code snippets that have been tested or used at some point. Probably nothing interesting to see here.


  • DOES: contains a number of functions used in different parts of the project.


  • DOES: contains backend code for local live demo webpage.


  • DOES: contains frontend code for local live demo webpage.