Zero-shot Image Captioning with VisualBERT and Show and Tell

This is a project that demonstrates zero-shot image captioning using two popular models in computer vision and natural language processing: VisualBERT and Show and Tell.

Requirements

  • Python
  • PyTorch
  • transformers
  • torchvision
  • tensorflow

The repository contains two jupyter notebookks, one for VisualBERT and the other for Show and Tell

VisualBERT Notebook

  • Load the VisualBERT Model
  • Load the COCO Dataset
  • Extract Image Features Using ResNET18
  • Fine Tune VisualBERT (This is where the error occur)

Show and Tell Notebook

  • Load the COCO Dataset
  • Encode the captions
  • Preprocess the Images
  • Build the Model
  • Train the Model (This is where the errors occur)