/neural_image_caption

Implementing a ConvNet+LSTM caption net

Primary LanguagePython

A Neural Image Caption Generator

  1. Introduction
  2. Architecture

Introduction

Architecture

image

  • The pytorch implementation can be found in encoder_decoder.py.

Encoder

  • The encoder is an EfficientNet with weights pretrained on ImageNet.
  • The final layer of the EfficientNet is removed all prior layers are frozen for the duration of the training process.
  • The image embedding is passed through a linear layer to reduce the dimensionality of the feature vector to the dimensionality of the joint embedding space.
  • This final layer is jointly trained along with the decoder in order to learn the joint embedding space.

Decoder

  • The decoder is an LSTM which generates a caption for the image.
  • At the start of the decoding process, the feature vector from the encoder is passed through the LSTM to allow the hidden state to view the embedded representation of the image.
  • A linear layer is added in order to map the hidden state outputs to the vocabulary space, in order to generate a probability distribution over the next word in the caption.