Image Captioning: A comparative study

Image captioning, defined as the automatic description of the content of an image, is a fundamental problem in artificial intelligence that lies at the intersection of computer vision and natural language processing. It requires recognizing the important objects, their attributes, and their relationships in an image, and also generating syntactically and semantically correct sentences that describe the image. Deep learning-based techniques, which are known to work well for a wide array of complex tasks, have been used for Image Captioning as well. Here, we aim to present a comparative study of the performance of two deep learning-based image captioning techniques on the benchmark MS COCO dataset. We, thus, aim to compare these models using Natural Language Processing performance metrics such as BLEU, GLEU, METEOR and WER (Word Error Rate). Furthermore, we also analyze the performance of various model combinations by varying the parameters (vocabulary size, batch size, number of images, epochs) and convolutional network (ResNet50, Inception V3) used.

Dataset Used : MS COCO Dataset

ATTENTION MODEL + ResNet50

MODEL DESCRIPTION:
Attention Mechanism Vocabulary(Unique Words) Number of Images Training Epochs Captions per image Total Datapoints Training Batch Size
Bahdanau Attention 7,000 6,000 30 5 30,000 64
View Detailed Results View Detailed Epoch Results
Click To View Click To View
MODEL PERFORMANCE SUMMARY

EXAMPLE: GOOD PREDICTION RESULT

EXAMPLE: FAIR PREDICTION RESULT

EXAMPLE: BAD PREDICTION RESULT

ATTENTION MODEL + InceptionV3 :

MODEL PERFORMANCE SUMMARY

MODEL 1:

MODEL DESCRIPTION:
Attention Mechanism Vocabulary(Unique Words) Number of Images Training Epochs Captions per image Total Datapoints Training Batch Size
Bahdanau Attention 5,000 6,000 10 5 30,000 64
PREDICTION EXAMPLE

View Detailed Epoch Results View Folder Containing Image Results View Report
Click To View Click To View Click To View

MODEL 2:

MODEL DESCRIPTION:
Attention Mechanism Vocabulary(Unique Words) Number of Images Training Epochs Captions per image Total Datapoints Training Batch Size
Bahdanau Attention 7,000 6,000 30 5 30,000 64
PREDICTION EXAMPLE

View Detailed Epoch Results View Folder Containing Image Results View Report
Click To View Click To View Click To View

MODEL 3:

MODEL DESCRIPTION:
Attention Mechanism Vocabulary(Unique Words) Number of Images Training Epochs Captions per image Total Datapoints Training Batch Size
Bahdanau Attention 7,000 10,000 50 5 50,000 64
PREDICTION EXAMPLE

View Detailed Epoch Results View Folder Containing Image Results View Report
Click To View Click To View Click To View

Authors

  1. Hemanth Teja Y (New York University) hy1713@nyu.edu
  2. Rahul Chinthala (New York University) rc4080@nyu.edu

REFERENCES

[1] Shetty, Rakshith, et al. “Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training.” ArXiv:1703.10476 [Cs], Nov. 2017. arXiv.org, http://arxiv.org/abs/1703.10476

[2] Hossain, Md Zakir, et al. “A Comprehensive Survey of Deep Learning for Image Captioning.” ArXiv:1810.04020 [Cs, Stat], Oct. 2018. arXiv.org,http://arxiv.org/abs/1810.04020}{http://arxiv.org/abs/1810.04020

[3] Tavakoli, Hamed R., et al. “Paying Attention to Descriptions Generated by Image Captioning Models.” ArXiv:1704.07434 [Cs], Aug. 2017. arXiv.org, http://arxiv.org/abs/1704.07434}{http://arxiv.org/abs/1704.07434

[4] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.

[5] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron C. Courville, Ruslan Salakhutdinov, Richard S.Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR,abs/1502.03044, 2015.

[6] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence ´Zitnick. Microsoft coco: Common objects in context. InDavid Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.

[7] Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. Collective generation of natural image descriptions. pages 359–368, 2012.

[8] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):664–676, Apr. 2017.

[9] PyTorch. https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html