Image Captioning with Azure ML

A deep learning model to generate automatic descriptive captions for Flickr images

Architecture

Architecture diagram

The workflow for this project consists of Azure Blob Storage, Azure ML, and Azure Computer Vision API.

Neural network architecture

Below is the general architecture for how we will build the deep learning model based on captions and images. We will utilize transfer learning and sequence models to generate captions.

Project Plan

Objectives:

Build a supervised deep learning model that can create alt-text captions for images
Train different models and select the one with the highest accuracy to compare against the caption generated by the Cognitive Services Computer Vision API

Output and success metrics:

Generate a short caption for an image randomly selected from the test dataset and compare it to the caption from the Computer Vision API output
High accuracy rate in predicting captions for images and Bleu score

About the data:

Flickr30k dataset (hosted on Kaggle) with roughly 30k images in JPEG format with over 158k captions. It has not been split into pre-defined training and test sets.
There are 5 different captions for the same image

Modeling techniques:

Transfer learning using Keras VGG16 or Inceptionv3 and RNN model (LSTM or GRU) to sequence over natural-language image captions
Pre-trained word vectors through GloVe

Execution stages:

Prepare data
1. Download and store data in blob storage
2. Clean captions data
3. Build a list of images and corresponding captions (i.e., image-input and text-output)
4. Split data into training and validation sets
Create vocabulary from the training dataset
1. Preprocess captions
2. Get unique words from all image captions
3. Load pretrained word embeddings (GloVe)
4. Tokenize captions into Tensorflow records - insert end of sentence tokens, etc.
Use images to train a model
1. Get data from blob storage
2. Extract features from photos using VGG model
3. Pass images as vectors through the RNN Decoder
Predict captions using trained model
Test model on validation data and measure accuracy
(if time): compare predicted image captions from the model to captions created by Cognitive Services API

Software Frameworks:

Python
Keras, Tensorflow
Skikit-learn and matplotlib
GloVe
Pandas

wongamanda/image-captioning

Image Captioning with Azure ML

Architecture

Project Plan