Image Captioning: A comparative study

Image captioning, defined as the automatic description of the content of an image, is a fundamental problem in artificial intelligence that lies at the intersection of computer vision and natural language processing. It requires recognizing the important objects, their attributes, and their relationships in an image, and also generating syntactically and semantically correct sentences that describe the image. Deep learning-based techniques, which are known to work well for a wide array of complex tasks, have been used for Image Captioning as well. Here, we aim to present a comparative study of the performance of two deep learning-based image captioning techniques on the benchmark MS COCO dataset. We, thus, aim to compare these models using Natural Language Processing performance metrics such as BLEU, GLEU, METEOR and WER (Word Error Rate). Furthermore, we also analyze the performance of various model combinations by varying the parameters (vocabulary size, batch size, number of images, epochs) and convolutional network (ResNet50, Inception V3) used.

Dataset Used : MS COCO Dataset

ATTENTION MODEL + ResNet50

MODEL DESCRIPTION:

Attention Mechanism	Vocabulary(Unique Words)	Number of Images	Training Epochs	Captions per image	Total Datapoints	Training Batch Size
Bahdanau Attention	7,000	6,000	30	5	30,000	64

View Detailed Results	View Detailed Epoch Results
Click To View	Click To View

MODEL PERFORMANCE SUMMARY

EXAMPLE: GOOD PREDICTION RESULT

EXAMPLE: FAIR PREDICTION RESULT

EXAMPLE: BAD PREDICTION RESULT

ATTENTION MODEL + InceptionV3 :

MODEL PERFORMANCE SUMMARY

MODEL 1:

MODEL DESCRIPTION:

Attention Mechanism	Vocabulary(Unique Words)	Number of Images	Training Epochs	Captions per image	Total Datapoints	Training Batch Size
Bahdanau Attention	5,000	6,000	10	5	30,000	64

PREDICTION EXAMPLE

View Detailed Epoch Results	View Folder Containing Image Results	View Report
Click To View	Click To View	Click To View

MODEL 2:

MODEL DESCRIPTION:

Attention Mechanism	Vocabulary(Unique Words)	Number of Images	Training Epochs	Captions per image	Total Datapoints	Training Batch Size
Bahdanau Attention	7,000	6,000	30	5	30,000	64

PREDICTION EXAMPLE

View Detailed Epoch Results	View Folder Containing Image Results	View Report
Click To View	Click To View	Click To View

MODEL 3:

MODEL DESCRIPTION:

Attention Mechanism	Vocabulary(Unique Words)	Number of Images	Training Epochs	Captions per image	Total Datapoints	Training Batch Size
Bahdanau Attention	7,000	10,000	50	5	50,000	64

PREDICTION EXAMPLE

View Detailed Epoch Results	View Folder Containing Image Results	View Report
Click To View	Click To View	Click To View

Authors

Hemanth Teja Y (New York University) hy1713@nyu.edu
Rahul Chinthala (New York University) rc4080@nyu.edu

REFERENCES

[1] Shetty, Rakshith, et al. “Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training.” ArXiv:1703.10476 [Cs], Nov. 2017. arXiv.org, http://arxiv.org/abs/1703.10476

[2] Hossain, Md Zakir, et al. “A Comprehensive Survey of Deep Learning for Image Captioning.” ArXiv:1810.04020 [Cs, Stat], Oct. 2018. arXiv.org,http://arxiv.org/abs/1810.04020}{http://arxiv.org/abs/1810.04020

[3] Tavakoli, Hamed R., et al. “Paying Attention to Descriptions Generated by Image Captioning Models.” ArXiv:1704.07434 [Cs], Aug. 2017. arXiv.org, http://arxiv.org/abs/1704.07434}{http://arxiv.org/abs/1704.07434

[4] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.

[5] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron C. Courville, Ruslan Salakhutdinov, Richard S.Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR,abs/1502.03044, 2015.

[6] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence ´Zitnick. Microsoft coco: Common objects in context. InDavid Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.

[7] Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. Collective generation of natural image descriptions. pages 359–368, 2012.

[8] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):664–676, Apr. 2017.

[9] PyTorch. https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

HemanthTejaY/Deep-Learning-Image-Captioning---A-comparative-study

Image Captioning: A comparative study

ATTENTION MODEL + ResNet50

MODEL DESCRIPTION:

MODEL PERFORMANCE SUMMARY

EXAMPLE: GOOD PREDICTION RESULT

EXAMPLE: FAIR PREDICTION RESULT

EXAMPLE: BAD PREDICTION RESULT

ATTENTION MODEL + InceptionV3 :

MODEL PERFORMANCE SUMMARY

MODEL 1:

MODEL DESCRIPTION:

PREDICTION EXAMPLE

MODEL 2:

MODEL DESCRIPTION:

PREDICTION EXAMPLE

MODEL 3:

MODEL DESCRIPTION:

PREDICTION EXAMPLE

Authors

REFERENCES