/show-attend-and-tell

Primary LanguageJupyter NotebookMIT LicenseMIT

Show, Attend and Tell

TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention which introduces an attention based image caption generator. The model changes its attention to the relevant part of the image while it generates each word.


alt text

References

Author's theano code: https://github.com/kelvinxu/arctic-captions

The tensorflow implementation I refer to strongly : https://github.com/yunjey/show-attend-and-tell
(I followed the code of this implementation and tried to understand comparing paper and implementation.
Also, I changed Tensorflow r0.12 from Tensorflow r1.4.)


My Experiment Environment

MacBook Pro (Retina, 15-inch, Mid 2015)
(I tried to train using only cpu.)

** In case of MS COCO dataset, the training set is too huge.
** In mac os, the limit of swap memory is approximately 50GB.
** So, I reduced MS COCO train, val, and test set to one-third repectively. (1/3)

Getting Started

Prerequisites

First, clone this repo and pycocoevalcap in same directory.

$ git clone https://github.com/leejk526/show-attend-and-tell.git
$ git clone https://github.com/tylin/coco-caption.git

This code is written in Python2.7 and requires TensorFlow. In addition, you need to install a few more packages to process MSCOCO data set. I have provided a script to download the MSCOCO image dataset and VGGNet19 model. Downloading the data may take several hours depending on the network speed. Run commands below then the images will be downloaded in image/ directory and VGGNet19 model will be downloaded in data/ directory.

$ cd show-attend-and-tell-tensorflow
$ pip install -r requirements.txt
$ chmod +x ./download.sh
$ ./download.sh

For feeding the image to the VGGNet, you should resize the MSCOCO image dataset to the fixed size of 224x224. Run command below then resized images will be stored in image/train2014_resized/ and image/val2014_resized/ directory.

$ python resize.py

Before training the model, you have to preprocess the MSCOCO caption dataset. To generate caption dataset and image feature vectors, run command below.

$ python prepro.py

Train the model

To train the image captioning model, run command below.

$ python train.py

(optional) Tensorboard visualization

I have provided a tensorboard visualization for real-time debugging. Open the new terminal, run command below and open http://localhost:6005/ into your web browser.

$ tensorboard --logdir='./log' --port=6005 

Evaluate the model

To generate captions, visualize attention weights and evaluate the model, please see evaluate_model.ipynb.



Result

Loss graph in Tensorboard

alt text
(The bottom line is a mistake.)

about 6 epoches

Validation data

(1) Generated caption: A table with a plate of food and cup of coffee .

alt text

(2) Genrated caption: A large long train on a steel truck .

alt text

Test data

(1) Generated caption: A group of people sitting around a table with wine glasses .

alt text

(2) Generated caption: A group of people standing around a large airplane .

alt text