This is a project that demonstrates zero-shot image captioning using two popular models in computer vision and natural language processing: VisualBERT and Show and Tell.
Requirements
- Python
- PyTorch
- transformers
- torchvision
- tensorflow
The repository contains two jupyter notebookks, one for VisualBERT and the other for Show and Tell
- Load the VisualBERT Model
- Load the COCO Dataset
- Extract Image Features Using ResNET18
- Fine Tune VisualBERT (This is where the error occur)
- Load the COCO Dataset
- Encode the captions
- Preprocess the Images
- Build the Model
- Train the Model (This is where the errors occur)