Encoder CNN is used with a language‑generating decoder RNN to generate a fitting natural‑language caption from the image. SpaCy tokenization is utilized for splitting strings into individual words and mapped to corresponding index values.
- PyTorch
- Torchvision
- Spacy
- Tensorboard
- PIL
- TQDM
To train on the Flickr8k dataset, you will need to download Flickr8k and keep it inside the folder data/
. The data directory tree will look like this:
data/
flickr8k/
images/
captions.txt
- Pre-trained models are available in checkpoint. The checkpoint directory tree will look like this:
checkpoint/
model_checkpoint.pth.tar
The pixels to Phrases pipeline is trained using
python train.py
The pixels-to-phrases pipeline is evaluated using
python test.py