/ImageCaptions

A base model for image captions.

Primary LanguagePython

ImageCaptions

A base model for image captioning

Config

  • python 2.7
  • tensorflow 1.8.0
  • python package
    • nltk
    • PIL
    • json
    • numpy

It is all of common tookits, so I don't give their links.

DataDownload

  • coco image dataset
    • you need to download train2017.zip
    • then unzip it to dir 'data/train2017/'
  • coco image annotations
  • pretrain inception model

Train

First, get post proccess data

  • get 'data/captions.json', 'data/captions_gt.json'
    $ cd preproccess
    $ python data_entry.py    
  • get 'data/image_id_train.json', 'data/image_id_val.json', 'data/image_id_test.json'
    $ cd preproccess
    $ python image_id_split.py    
  • get 'data/vocabulary.json'
    $ cd preproccess
    $ python vocabulary.py    

Second, get TFRecord files

Because dataset is too large, we should do some operations to purse speed and CPU|GPU efficiency. You need to wait 30 mins to convert data to 'data/tfrecord/train-xx.tfrecord', I convert Train Data to 40 tfrecord files.

  • get 'data/tfrecord/train-00.tfrecord' - 'data/tfrecord/train-39.tfrecord'
    $ python datasets.py    
  • so you need get 'data/tfrecord_name_train.json' for tensorflow filename queue, it is easy
  • the val dataset and test data is the same.

Third, let's go train

    $ python main.py    

Experiments

Train/Val/Test Dataset, 82783/5000/5000, vocabulary size = 14643 and we not filter out word. We use greedy search not beam search.

CNN+RNN

BLEU_1 BLEU_2 BLEU_3 BLEU_4 METEOR ROUGE CIDEr
Train Dataset 0.7051 0.5322 0.3832 0.2682 0.2283 0.5128 0.7968
Val Dataset 0.6667 0.4866 0.3405 0.2337 0.2096 0.4831 0.7024
Test Dataset 0.6687 0.4879 0.3421 0.2364 0.2096 0.4838 0.6972
Paper 0.666 0.461 0.329 0.246 - - -

e.g. Show and Tell: A Neural Image Caption Generator, CVPR 2015(pdf)

CNN+RNN+Soft-Attention

BLEU_1 BLEU_2 BLEU_3 BLEU_4 METEOR ROUGE CIDEr
Val Dataset 0.6467 0.4615 0.3180 0.2177 0.2014 0.4684 0.6310
Test Dataset 0.6482 0.4638 0.3210 0.2217 0.2013 0.4633 0.6245
Paper 0.707 0.492 0.344 0.243 0.2390 - -

e.g. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML(pdf)

Example

examples

Summary

The model is very very*N simple, I never adjust the hyperparameter, so if you want, you could do.

References