Pixels to Phrases

Encoder CNN is used with a language‑generating decoder RNN to generate a fitting natural‑language caption from the image. SpaCy tokenization is utilized for splitting strings into individual words and mapped to corresponding index values.

Requirements:

PyTorch
Torchvision
Spacy
Tensorboard
PIL
TQDM

Training

To train on the Flickr8k dataset, you will need to download Flickr8k and keep it inside the folder data/. The data directory tree will look like this:

data/
    flickr8k/
        images/
        captions.txt

Pre-Train Checkpoints

Pre-trained models are available in checkpoint. The checkpoint directory tree will look like this:

checkpoint/
    model_checkpoint.pth.tar

Commands

The pixels to Phrases pipeline is trained using

python train.py

The pixels-to-phrases pipeline is evaluated using

python test.py

KaziRamisaRifa/Pixels-to-Phrases

Pixels to Phrases

Requirements:

Training

Pre-Train Checkpoints

Commands