Image Caption Generator

Given an image, generates a caption for it using two different neural networks; Convolutional Neural Network (CNN) and Long Short Term Memory Network (LSTM).

It uses transfer learning using Xception model to leverage the model's trained parameters to encode an image to a 2048 feature vector which is then fed into an LSTM to predict a caption based on the features extracted by Xception.

Model Architecture

We remove the last layer of Xception network
Image is fed into this modified network to generate a 2048 length encoding corresponding to it
The 2048 length vector is then fed into a second neural network along with a caption for the image (while training)
This second network consists of an LSTM which tries to generate a caption for the image

Examples

Here are some captions generated by the network:

References

F. Chollet, "Xception: Deep Learning with Depthwise Separable Convolutions," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1800-1807, doi: 10.1109/CVPR.2017.195.
Read Here
Hochreiter, Sepp & Schmidhuber, Jürgen. (1997). Long Short-term Memory. Neural computation. 9. 1735-80.
Read Here
Lecun, Yann & Haffner, Patrick & Bengio, Y.. (2000). Object Recognition with Gradient-Based Learning.
Read Here

akarsh-saxena/Image-Caption-Generator

Image Caption Generator

Model Architecture

Examples

References