
"ImageSpeak.ipynb," implements an Image Caption Generator using a combination of ResNet for image feature extraction and a Transformer Decoder for generating textual captions. Below is an overview of the main components and functionalities of the code:

Mounting Drive and Downloading Dataset from Kaggle

The script starts by mounting Google Drive to the Colab environment and downloading the Flickr8k dataset from Kaggle.

ImageSpeak Workflow

The ImageSpeak model is divided into two main steps:

1. Feature Extraction using ResNet:

  • Utilizes a pre-trained ResNet-18 model to extract image features.
  • Removes duplicate images from the training and validation datasets.

2. Transformer Decoder Model:

-Implements a Transformer Decoder for generating image captions.

  • Preprocesses and tokenizes captions, removes single-character and non-alphabetic words, and pads sequences.
  • Creates a vocabulary and maps tokens to IDs.

Data Loading and Splitting

  • Reads captions from a txt file into a Pandas data frame. -
  • Preprocesses captions, removes single-character words and adds start, end, and padding tokens.
  • Splits the dataset into training and validation sets.

Vocabulary Creation

Counts occurrences of each word, creates a vocabulary and generates mappings from index to word and word to index.

ResNet Feature Extraction

  • Extracts ResNet-18 features for training and validation images.
  • Saves the feature vectors in pickle files.

Transformer Decoder Model

  • Defines a Transformer Decoder model with positional encoding for caption generation.
  • Implements training loop with CrossEntropyLoss and Adam optimizer.
  • Saves the best model based on validation loss.

Model Evaluation and Caption Generation

  • Loads the trained model and performs evaluation.
  • Uses beam search to generate captions for given images.


Demonstrates generating expressive captions for random validation images.