This project #2 at the Udacity Computer Vision nanodegree: generation of a picture description using an encoder-decoder deep learning architecture.
- This project based on original arxiv paper Show and Tell: A Neural Image Caption Generator (2015) paper;
- The encoder is pretrained resnet50 deep CNN available in pyTorch;
- The caption generator was trained on MS-COCO 2014 dataset.
The following steps for default using pretrained model:
$ git clone git clone https://github.com/alex-f1tor/Image-Caption.git
$ cd Image-Caption
$ mkdir models
$ cd models
$ wget https://drive.google.com/open?id=19mcr08t6gY0UcUiAKTkBPO8MP0_wsghV -O 'decoder-4.pkl' && wget https://drive.google.com/open?id=1xe4zTMQAnH8QxcwHF7-i2lnmoBecJPYT -O 'encoder-4.pkl'
You can find an example of using this caption generator at Inference.ipynb notebook.
Few examples of generated captions for images:
You can also:
- Train your own caption network with MS-COCO dataset based on pipeline at Training.ipynb
- Estimate model performance at cocoEvalCap.ipynb via different metrics, like CIDEr, Rouge-L and etc.
The general estimation of captions quality generated for MS-COCO validation set by CIDEr metric: