This is a curated list of Image Captioning papers, databases and codes.
Image captioning is the task of describing an image. In order to to that, one must recognize the objects, scenario, characters and theirs relationships on the figure, after that generate a sentence that represent the elements detected in a natural language way. Image captioning is a hard task, joining two different areas from Artificial Intelligence: Computer Vision and Natural Language Processing.
This repo is organized with: surveys, datasets, metrics and then by the strategies used to do Image Captioning. Starting from early proposals of description retrieval and template filling going all the way to the se of deep learning technique, starting with CNNs with RNNs to the use of Transformers for generating global representations and generate language.
First we added some surveys to help discovering the area:
- Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures
- Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, Barbara Plank
- From Show to Tell: A Survey on Deep Learning-based Image Captioning
- Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara
- Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics
- Hodosh, Micah and Young, Peter and Hockenmaier, Julia
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- Young, Peter and Lai, Alice and Hodosh, Micah and Hockenmaier, Julia
- Microsoft COCO: Common Objects in Context
- Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár
- VizWiz
- Automatic image captioning
- J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos
- Every Picture Tells a Story: Generating Sentences from Images
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth
- Im2text: Describing images using 1 million captioned photographs
- V. Ordonez, G. Kulkarni, and T. Berg
- Deep fragment embeddings for bidirectional image sentence mapping - Andrej Karpathy, Armand Joulin, Li Fei-Fei
- I2T: Image parsing to text description
- Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee and Song-Chun Zhu
- Corpus-guided sentence generation of natural images
- Y. Yang, C. Teo, H. Daumé III, and Y. Aloimonos
- Show and Tell: A Neural Image Caption Generator 2015
- Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention - 2015
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio
- Deep visual-semantic alignments for generating image descriptions - 2015
- Andrej Karpathy, Li Fei-Fei
- Review Networks for Caption Generation 2016
- Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen
- SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning 2017
- L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.- S. Chua
- Rethinking the Form of Latent States in Image Captioning - 2018
- Bo Dai, Deming Ye, Dahua Lin
- Attention on Attention for Image Captioning - 2019
- Lun Huang, Wenmin Wang, Jie Chen, Xiao-Yong Wei
- X-Linear Attention Networks for Image Captioning
- Yingwei Pan, Ting Yao, Yehao Li, Tao Mei
- Meshed-Memory Transformer for Image Captioning 2020
- Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara