/Pixels-to-Phrases

Primary LanguageJupyter NotebookMIT LicenseMIT

Pixels to Phrases

Encoder CNN is used with a language‑generating decoder RNN to generate a fitting natural‑language caption from the image. SpaCy tokenization is utilized for splitting strings into individual words and mapped to corresponding index values.

Requirements:

  • PyTorch
  • Torchvision
  • Spacy
  • Tensorboard
  • PIL
  • TQDM

Training

To train on the Flickr8k dataset, you will need to download Flickr8k and keep it inside the folder data/. The data directory tree will look like this:

data/
    flickr8k/
        images/
        captions.txt

Pre-Train Checkpoints

  • Pre-trained models are available in checkpoint. The checkpoint directory tree will look like this:
checkpoint/
    model_checkpoint.pth.tar

Commands

The pixels to Phrases pipeline is trained using

python train.py

The pixels-to-phrases pipeline is evaluated using

python test.py