IMAGE CAPTIONING

The problem of generaing captions based on the image provided can be effectively solved using Deep Neural Networks in the following program we use a CNN + RNN architecture.
Here the Convolutional Neural Network is used for extracting features from an image before passing it through the pipeline.
We use a VGG16 model that is trained for classifying images ,but instead of using the last classification layer,
we redirect the output of the previous layer.This gives us a vector with 4096 elements that summarizes the image-contents.
We will use this vector as the initial state of the Gated Recurrent Units(GRU).However we need to map the 4096 elements down to a
vector with only 512 as this is the internal state-size of the GRU .To do this we need an intermediate fully-connected(dense) layer.

INPUT : RGB Image size of (224,224)

OUTPUT: complete captions describing the image

DATASET: we are using Flickr30k dataset for training the model.

LOSS FUNCTION: We use a loss-function like sparse cross-entropy.

OPTIMIZER: We chose to use RMSprop over Adam optimizer as in some cases Adam Optimizer seems to diverge with Recurrent Neural Networks.

Implemented using: Tensorflow,keras

Model Summary:

The following is the summary of the VGG 16 model

The following is the summary of the Recurrent layer

The processed Tensorboard graphs are as follows

anand-371/image_captioning

IMAGE CAPTIONING

Model Summary: