mxnet_image_caption: A Python repository from saicoco

image caption generation

This is a simple implementaion of paper Neural [Image Caption]¹ based on mxnet.
Some codes refer where-to-image

Usage

Prepare the datasets and pre_train params to dirs datasets and pre_train, here we use pretrain-model is vgg-16, datasets are Filckr8k, you could replace it with your datasets and pretrain_model. For Flickr8k, which includes images and captions, captions are store in dataset.json, looks like following:

{"images": 
    [
        {"sentids": [0, 1, 2, 3, 4], 
        "imgid": 0, 
        "sentences": [
        {"tokens": ["a", "black", "dog", "is", "running", "after", "a", "white", "dog", "in", "the", "snow"], "raw": "A black dog is running after a white dog in the snow .", "imgid": 0, "sentid": 0}, 
        {"tokens": ["black", "dog", "chasing", "brown", "dog", "through", "snow"], "raw": "Black dog chasing brown dog through snow", "imgid": 0, "sentid": 1}, 
        {"tokens": ["two", "dogs", "chase", "each", "other", "across", "the", "snowy", "ground"], "raw": "Two dogs chase each other across the snowy ground .", "imgid": 0, "sentid": 2}, 
        {"tokens": ["two", "dogs", "play", "together", "in", "the", "snow"], "raw": "Two dogs play together in the snow .", "imgid": 0, "sentid": 3}, 
        {"tokens": ["two", "dogs", "running", "through", "a", "low", "lying", "body", "of", "water"], "raw": "Two dogs running through a low lying body of water .", "imgid": 0, "sentid": 4}
                    ], 
        "split": "train", "filename": "2513260012_03d33305cf.jpg"}, ...
    
    ],
"datasets":Flickr8k}

or you can download processed data from here, which image are extracted from vgg networks 4096-dim, and unzip them into dir 'datasets', then copy file which in "old" dir into root dir, and run it, this is a old version about NIC.

After data downloading completes, you can run:

python 1_preprocess_data.py

when it runs over, there will be a directory named "processed_data" which include train, val and test datasets which are splited by "split" key in dataset.json .

python 2_train_val.py to train model on your dataset and save you dataset.
There are something wrong with test stage(predict), (variable length for sym, I think I should use mx.mod.BuckingModule), I am trying~~~~~~~~~~~, if you find the solution, welcome to issue me.

Reference

Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3156-3164. ↩

saicoco/mxnet_image_caption

image caption generation

Usage

Reference

Footnotes