Visual Semantic Embeddings & Text-Image Retrieval papers: an incomplete list

Conferences

DeViSE: A Deep Visual-Semantic Embedding Model.
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov.
(NIPS 2013)
[paper]

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models.
Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel.
(NIPS 2014 Deep Learning Workshop)
[paper] [code](Theano)

Deep Visual-Semantic Alignments for Generating Image Descriptions.
Andrej Karpathy, Li Fei-Fei.
(CVPR 2015)
[paper]

Deep Correlation for Matching Images and Text.
Fei Yan, Krystian Mikolajczyk.
(CVPR 2015)
[paper]

ORDER-EMBEDDINGS OF IMAGES AND LANGUAGE.
Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun.
(ICLR 2016)
[paper]

Learning Deep Structure-Preserving Image-Text Embeddings.
Liwei Wang, Yin Li, Svetlana Lazebnik.
(CVPR 2016)
[paper]

Learning a Deep Embedding Model for Zero-Shot Learning.
Li Zhang, Tao Xiang, Shaogang Gong.
(CVPR 2017)
[paper] [code](TF)

Deep Visual-Semantic Quantization for Efficient Image Retrieval.
Yue Cao, Mingsheng Long, Jianmin Wang, Shichen Liu.
(CVPR 2017)
[paper]

Dual Attention Networks for Multimodal Reasoning and Matching.
Hyeonseob Nam, Jung-Woo Ha, Jeonghee Kim.
(CVPR 2017)
[paper]

Sampling Matters in Deep Embedding Learning.
Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl.
(ICCV 2017)
[paper] [zhihu discussion]

Learning Robust Visual-Semantic Embeddings.
Yao-Hung Hubert Tsai, Liang-Kang Huang, Ruslan Salakhutdinov.
(ICCV 2017)
[paper]

Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding.
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua.
(ICCV 2017)
[paper]

Learning a Recurrent Residual Fusion Network for Multimodal Matching.
Yu Liu, Yanming Guo, Erwin M. Bakker, Michael S. Lew.
(ICCV 2017)
[paper]

VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling.
Guibing Guo, Songlin Zhai, Fajie Yuan, Yuan Liu, Xingwei Wang.
(AAAI 2018)
[paper]

Incorporating GAN for Negative Sampling in Knowledge Representation Learning.
Peifeng Wang, Shuangyin Li, Rong pan.
(AAAI 2018)
[paper]

Fast Self-Attentive Multimodal Retrieval.
Jônatas Wehrmann, Maurício Armani Lopes, Martin D More, Rodrigo C. Barros.
(WACV 2018)
[paper] [code](PyTorch)

End-to-end Convolutional Semantic Embeddings.
Quanzeng You, Zhengyou Zhang, Jiebo Luo.
(CVPR 2018)
[paper]

Bidirectional Retrieval Made Simple.
Jonatas Wehrmann, Rodrigo C. Barros.
(CVPR 2018)
[paper] [code](PyTorch)

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search.
Jamie Kiros, William Chan, Geoffrey Hinton.
(ACL 2018)
[paper]

Learning Visually-Grounded Semantics from Contrastive Adversarial Samples.
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun.
(COLING 2018)
[paper] [code](PyTorch)

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler.
(BMVC 2018)
[paper] [code](PyTorch)

An Adversarial Approach to Hard Triplet Generation.
Yiru Zhao, Zhongming Jin, Guo-jun Qi, Hongtao Lu, Xian-sheng Hua.
(ECCV 2018)
[paper]

Conditional Image-Text Embedding Networks.
Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, Svetlana Lazebnik.
(ECCV 2018)
[paper]

Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach.
Angelo Carraggi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara.
(ECCV 2018)
[paper]

Stacked Cross Attention for Image-Text Matching.
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, Xiaodong He.
(ECCV 2018)
[paper]

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images.
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang.
(ECCV 2018)
[paper] [code](Caffe)

A Strong and Robust Baseline for Text-Image Matching.
Fangyu Liu, Rongtian Ye.
(ACL Student Research Workshop 2019)
[paper]

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations.
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma.
(CVPR 2019)
[paper]

Engaging Image Captioning via Personality.
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, Jason Weston.
(CVPR 2019)
[paper]

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval.
Yale Song, Mohammad Soleymani.
(CVPR 2019)
[paper]

Composing Text and Image for Image Retrieval - An Empirical Odyssey.
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays.
(CVPR 2019)
[paper]

Annotation Efficient Cross-Modal Retrieval with Adversarial Attentive Alignment.
Po-Yao Huang, Guoliang Kang, Wenhe Liu, Xiaojun Chang, Alexander G Hauptmann.
(ACM MM 2019)
[paper]

Visual Semantic Reasoning for Image-Text Matching.
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, Yun Fu.
(ICCV 2019)
[paper]

Adversarial Representation Learning for Text-to-Image Matching.
Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris.
(ICCV 2019)
[paper]

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval.
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, Jing Shao.
(ICCV 2019)
[paper]

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment.
Samyak Datta, Karan Sikka, Anirban Roy, Karuna Ahuja, Devi Parikh, Ajay Divakaran.
(ICCV 2019)
[paper]

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations.
Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann.
(EMNLP 2019)
[paper]

Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents.
Jack Hessel, Lillian Lee, David Mimno.
(EMNLP 2019)
[paper] [code]

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs.
Fangyu Liu, Rongtian Ye, Xun Wang, Shuaipeng Li.
(AAAI 2020)
[paper] [code] (PyTorch)

Ladder Loss for Coherent Visual-Semantic Embedding.
Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, Gang Hua.
(AAAI 2020)
[paper]

Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching.
Tianlang Chen, Jiebo Luo.
(AAAI 2020)
[paper]

Adaptive Cross-modal Embeddings for Image-Text Alignment.
Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros.
(AAAI 2020)
[paper] [code] (PyTorch)

Graph Structured Network for Image-Text Matching.
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang.
(CVPR 2020)
[paper]

IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval.
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han.
(CVPR 2020)
[paper]

Visual-Semantic Matching by Exploring High-Order Attention and Distraction.
Yongzhi Li, Duo Zhang, Yadong Mu.
(CVPR 2020)
[paper]

Multi-Modality Cross Attention Network for Image and Sentence Matching.
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu.
(CVPR 2020)
[paper]

Context-Aware Attention Network for Image-Text Retrieval.
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li.
(CVPR 2020)
[paper]

Universal Weighting Metric Learning for Cross-Modal Matching.
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen.
(CVPR 2020)
[paper]

Graph Optimal Transport for Cross-Domain Alignment.
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu.
(ICML 2020)
[paper]

Adaptive Offline Quintuplet Loss for Image-Text Matching.
Tianlang Chen, Jiajun Deng, Jiebo Luo.
(ECCV 2020)
[paper] [code](PyTorch)

Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval.
Yanbei Chen, Loris Bazzani.
(ECCV 2020)
[paper]

Consensus-Aware Visual-Semantic Embeddingfor Image-Text Matching.
Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma.
(ECCV 2020)
[paper] [code](PyTorch)

Contrastive Learning for Weakly Supervised Phrase Grounding.
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem.
(ECCV 2020)
[paper] [code](PyTorch)

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval.
Christopher Thomas, Adriana Kovashka.
(ECCV 2020)
[paper]

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case.
Adam Dahlgren Lindström, Suna Bensch, Johanna Björklund, Frank Drewes.
(COLING 2020)
[paper] [code](PyTorch)

Journals

Large scale image annotation: learning to rank with joint word-image embeddings.
Jason Weston, Samy Bengio, Nicolas Usunier.
(Machine Learning 2010)
[paper]

Grounded Compositional Semantics for Finding and Describing Images with Sentences.
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, Andrew Y. Ng.
(TACL 2014)
[paper]

Learning Two-Branch Neural Networks for Image-Text Matching Tasks.
Liwei Wang, Yin Li, Jing Huang, Svetlana Lazebnik.
(IPAMI 2019)
[paper] [code](TF)

Upgrading the Newsroom: An Automated Image Selection System for News Articles.
Fangyu Liu, Rémi Lebret, Didier Orel, Philippe Sordet, Karl Aberer.
(ACM TOMM 2020)
[paper] [slides] [demo]