kd1510/neural_image_caption

Implementing a ConvNet+LSTM caption net

Python

A Neural Image Caption Generator

Introduction
Architecture

Introduction

This is a neural image caption generator based on the paper Show and Tell: A Neural Image Caption Generator by Vinyals et al.
The model is trained on the Flickr8k dataset.

Architecture

The pytorch implementation can be found in encoder_decoder.py.

Encoder

The encoder is an EfficientNet with weights pretrained on ImageNet.
The final layer of the EfficientNet is removed all prior layers are frozen for the duration of the training process.
The image embedding is passed through a linear layer to reduce the dimensionality of the feature vector to the dimensionality of the joint embedding space.
This final layer is jointly trained along with the decoder in order to learn the joint embedding space.

Decoder

The decoder is an LSTM which generates a caption for the image.
At the start of the decoding process, the feature vector from the encoder is passed through the LSTM to allow the hidden state to view the embedded representation of the image.
A linear layer is added in order to map the hidden state outputs to the vocabulary space, in order to generate a probability distribution over the next word in the caption.