A deep learning model to generate automatic descriptive captions for Flickr images
Architecture diagram
The workflow for this project consists of Azure Blob Storage, Azure ML, and Azure Computer Vision API.
Neural network architecture
Below is the general architecture for how we will build the deep learning model based on captions and images. We will utilize transfer learning and sequence models to generate captions.
Objectives:
- Build a supervised deep learning model that can create alt-text captions for images
- Train different models and select the one with the highest accuracy to compare against the caption generated by the Cognitive Services Computer Vision API
Output and success metrics:
- Generate a short caption for an image randomly selected from the test dataset and compare it to the caption from the Computer Vision API output
- High accuracy rate in predicting captions for images and Bleu score
About the data:
- Flickr30k dataset (hosted on Kaggle) with roughly 30k images in JPEG format with over 158k captions. It has not been split into pre-defined training and test sets.
- There are 5 different captions for the same image
Modeling techniques:
- Transfer learning using Keras VGG16 or Inceptionv3 and RNN model (LSTM or GRU) to sequence over natural-language image captions
- Pre-trained word vectors through GloVe
Execution stages:
- Prepare data
- Download and store data in blob storage
- Clean captions data
- Build a list of images and corresponding captions (i.e., image-input and text-output)
- Split data into training and validation sets
- Create vocabulary from the training dataset
- Preprocess captions
- Get unique words from all image captions
- Load pretrained word embeddings (GloVe)
- Tokenize captions into Tensorflow records - insert end of sentence tokens, etc.
- Use images to train a model
- Get data from blob storage
- Extract features from photos using VGG model
- Pass images as vectors through the RNN Decoder
- Predict captions using trained model
- Test model on validation data and measure accuracy
- (if time): compare predicted image captions from the model to captions created by Cognitive Services API
Software Frameworks:
- Python
- Keras, Tensorflow
- Skikit-learn and matplotlib
- GloVe
- Pandas