/multimodal-ml-reading-list

Reading list for research topics in multimodal machine learning

Reading List for Topics in Multimodal Machine Learning

By Paul Pu Liang (pliang@cs.cmu.edu), Machine Learning Department and Language Technologies Institute, CMU, with help from Yao Chong Lim (yaochonl@cs.cmu.edu) and other members from the MultiComp Lab at LTI, CMU. If there are any areas, papers, and datasets I missed, please let me know!

Research Papers

Survey Papers

Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018

Representation Learning: A Review and New Perspectives, TPAMI 2013

Core Areas

Representation Learning

Deep Multimodal Representation Learning: A Survey, 2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations, CVPR 2019

Multi-Task Learning of Hierarchical Vision-Language Representation, CVPR 2019

Learning Factorized Multimodal Representations, ICLR 2019 [code]

A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks, ICML 2018

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?, ACL 2018

Deep Multimodal Representation Learning from Temporal Data, CVPR 2017

Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

DeViSE: A Deep Visual-Semantic Embedding Model , NeurIPS 2013

Multimodal Deep Learning, ICML 2011

Multimodal Fusion

MFAS: Multimodal Fusion Architecture Search, CVPR 2019

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, ICLR 2019 [code]

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018 [code]

Multimodal Alignment

On Deep Multi-View Representation Learning, ICML 2015

Multimodal Alignment of Videos, MM 2014

Deep Canonical Correlation Analysis, ICML 2013 [code]

Knowledge Graphs and Knowledge Bases

MMKG: Multi-Modal Knowledge Graphs, ESWC 2019

Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs, AKBC 2019

Embedding Multimodal Relational Data for Knowledge Base Completion, EMNLP 2018

A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning, SEM 2018 [code]

Order-Embeddings of Images and Language, ICLR 2016 [code]

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries, arXiv 2015

Intepretable Learning

Multimodal Explanations by Predicting Counterfactuality in Videos, CVPR 2019

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence, CVPR 2018 [code]

Do Explanations make VQA Models more Predictable to a Human?, EMNLP 2018

Towards Transparent AI Systems: Interpreting Visual Question Answering Models, ICML Workshop on Visualization for Deep Learning 2016

Generative Learning

Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018

The Multi-Entity Variational Autoencoder, NeurIPS 2017

Semi-supervised Learning

Semi-supervised Vision-language Mapping via Variational Learning, ICRA 2017

Semi-supervised Multimodal Hashing, arXiv 2017

Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition, IJCAI 2016

Multimodal Semi-supervised Learning for Image Classification, CVPR 2010

Self-supervised Learning

Self-Supervised Learning from Web Data for Multimodal Retrieval, arXiv 2019

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces, CVPR 2017

Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems, 2016

Language Models

Neural Language Modeling with Visual Features, arXiv 2019

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, ICML 2014 [code]

Applications

Language and Visual QA

MUREL: Multimodal Relational Reasoning for Visual Question Answering, CVPR 2019 [code]

Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence, CVPR 2019 [code]

Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering, ICML 2019 [code]

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, NeurIPS 2018 [code]

RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes, EMNLP 2018 [code]

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR 2018 [code]

Stacked Latent Attention for Multimodal Reasoning, CVPR 2018

Learning to Reason: End-to-End Module Networks for Visual Question Answering, ICCV 2017 [code]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017 [code] [dataset generation]

Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension, CVPR 2017 [code]

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, EMNLP 2016 [code]

MovieQA: Understanding Stories in Movies through Question-Answering, CVPR 2016 [code]

VQA: Visual Question Answering, ICCV 2015 [code]

Language Grounding in Vision

Grounded Video Description, CVPR 2019

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions, CVPR 2019

Visual Coreference Resolution in Visual Dialog using Neural Module Networks, ECCV 2018 [code]

Using Syntax to Ground Referring Expressions in Natural Images, AAAI 2018 [code]

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts, NeurIPS 2017

Localizing Moments in Video with Natural Language, ICCV 2017

What are you talking about? Text-to-Image Coreference, CVPR 2014

Language Grouding in Navigation

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation, ACL 2019

Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation, CVPR 2019

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation, CVPR 2019

The Regretful Navigation Agent for Vision-and-Language Navigation, CVPR 2019 [code]

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation, CVPR 2019 [code]

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation, ICLR 2019 [code]

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout, NAACL 2019 [code]

Embodied Question Answering, CVPR 2018 [code]

Multimodal Machine Translation

Probing the Need for Visual Context in Multimodal Machine Translation, NAACL 2019

Multi30K: Multilingual English-German Image Descriptions, ACL Workshop on Language and Vision 2016

Does Multimodality Help Human and Machine for Translation and Image Captioning?, ACL WMT 2016

Multi-agent Communication

Emergence of Compositional Language with Deep Generational Transmission, ICML 2019

On the Pitfalls of Measuring Emergent Communication, AAMAS 2019 [code]

Emergent Translation in Multi-Agent Communication, ICLR 2018 [code]

Emergence of Linguistic Communication From Referential Games with Symbolic and Pixel Input, ICLR 2018

Emergent Communication through Negotiation, ICLR 2018 [code]

Emergence of Grounded Compositional Language in Multi-Agent Populations, AAAI 2018

Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols, NeurIPS 2017

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog, EMNLP 2017 [code1] [code2]

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, ICCV 2017 code

Multi-agent cooperation and the emergence of (natural) language, ICLR 2017

Learning to communicate with deep multi-agent reinforcement learning, NIPS 2016.

Learning multiagent communication with backpropagation, NIPS 2016.

The Emergence of Compositional Structures in Perceptually Grounded Language Games, AI 2005

Commonsense Reasoning

SocialIQA: Commonsense Reasoning about Social Interactions, arXiv 2019

From Recognition to Cognition: Visual Commonsense Reasoning, CVPR 2019 [code]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, NAACL 2019

Multimodal Reinforcement Learning

Habitat: A Platform for Embodied AI Research, arXiv 2019 [code]

Embodied Multimodal Multitask Learning, arXiv 2019

Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog, SIGDIAL 2018

Multimodal Dialog

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, ACL 2019 [code]

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog, NAACL 2019 [code]

Dialog-based Interactive Image Retrieval, NeurIPS 2018 [code]

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems, arXiv 2017 [code]

Visual Dialog, CVPR 2017 [code]

Language and Audio

Audio Caption: Listen and Tell, ICASSP 2019

Audio-Linguistic Embeddings for Spoken Sentences, ICASSP 2019

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings, arXiv 2019

From Audio to Semantics: Approaches To End-to-end Spoken Language Understanding, arXiv 2018

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning, ICLR 2018

Deep Voice 2: Multi-Speaker Neural Text-to-Speech, NeurIPS 2017

Deep Voice: Real-time Neural Text-to-Speech, ICML 2017

Text-to-Speech Synthesis, 2009

Audio and Visual

Reconstructing Faces from Voices, arXiv 2019

Speech2Face: Learning the Face Behind a Voice, CVPR 2019 [code]

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces, ICLR 2019

Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks, ICASSP 2019 [code]

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input, ECCV 2018 [code]

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching, CVPR 2018 [code]

Unsupervised Learning of Spoken Language with Visual Context, NeurIPS 2016

SoundNet: Learning Sound Representations from Unlabeled Video, NeurIPS 2016 [code]

Media Description

Neural Baby Talk, CVPR 2018 [code]

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos, CVPR 2018 [code]

Neural Motifs: Scene Graph Parsing with Global Context, CVPR 2018 [code]

Generating Descriptions with Grounded and Co-Referenced People, CVPR 2017

Review Networks for Caption Generation, NeurIPS 2016 [code]

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV 2016 [code]

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, TPAMI 2016 [code]

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 [code]

Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015 [code]

Show and Tell: A Neural Image Caption Generator, CVPR 2015 [code]

A Dataset for Movie Description, CVPR 2015 [code]

What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision, NAACL 2015 [code]

Microsoft COCO: Common Objects in Context, ECCV 2014 [code]

Affect Recognition

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph, ACL 2018 [code]

AMHUSE - A Multimodal dataset for HUmor SEnsing, ICMI 2017 [code]

Collecting Large, Richly Annotated Facial-Expression Databases from Movies, IEEE Multimedia 2012 [code]

Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper), ACL 2019 [code]

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database [code]

Decoding Children’s Social Behavior, CVPR 2013 [code]

Healthcare

Unsupervised Multimodal Representation Learning across Medical Images and Reports, ML4H 2018

Multimodal Medical Image Retrieval based on Latent Topic Modeling, ML4H 2018

Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning, ML4H 2018

Knowledge-driven Generative Subspaces for Modeling Multi-view Dependencies in Medical Data, ML4H 2018

Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors, TAC 2018

Learning the Joint Representation of Heterogeneous Temporal Events for Clinical Endpoint Prediction, AAAI 2018

Understanding Coagulopathy using Multi-view Data in the Presence of Sub-Cohorts: A Hierarchical Subspace Approach, MLHC 2017

Machine Learning in Multimodal Medical Imaging, 2017

Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data, ML4H 2017

SimSensei Kiosk: A Virtual Human Interviewer for Healthcare Decision Support, AAMAS 2014

Dyadic Behavior Analysis in Depression Severity Assessment Interviews, ICMI 2014

Audiovisual Behavior Descriptors for Depression Assessment, ICMI 2013

Robotics

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks, ICRA 2019

Evolving Multimodal Robot Behavior via Many Stepping Stones with the Combinatorial Multi-Objective Evolutionary Algorithm , arXiv 2018

Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction, arXiv 2017

Perching and Vertical Climbing: Design of a Multimodal Robot, ICRA 2014

Multi-Modal Scene Understanding for Robotic Grasping, 2011

Strategies for Multi-Modal Scene Exploration, IROS 2010

Workshops

Beyond Vision and Language: Integrating Real-World Knowledge, EMNLP 2019

The How2 Challenge: New Tasks for Vision & Language, ICML 2019

Visual Question Answering and Dialog, CVPR 2019, CVPR 2017

Multi-modal Learning from Videos, CVPR 2019

Multimodal Learning and Applications Workshop, CVPR 2019, ECCV 2018

Habitat: Embodied Agents Challenge and Workshop, CVPR 2019

Closing the Loop Between Vision and Language & LSMD Challenge, ICCV 2019

Multi-modal Video Analysis and Moments in Time Challenge, ICCV 2019

Cross-Modal Learning in Real World, ICCV 2019

Spatial Language Understanding and Grounded Communication for Robotics, NAACL 2019

YouTube-8M Large-Scale Video Understanding, ICCV 2019, ECCV 2018, CVPR 2017

Language and Vision Workshop, CVPR 2019, CVPR 2018, CVPR 2017, CVPR 2015

Sight and Sound, CVPR 2019, CVPR 2018

The Large Scale Movie Description Challenge (LSMDC), ICCV 2019, ICCV 2017

Visually Grounded Interaction and Language, NeurIPS 2018

Wordplay: Reinforcement and Language Learning in Text-based Games, NeurIPS 2018

Interpretability and Robustness in Audio, Speech, and Language, NeurIPS 2018

Multimodal Robot Perception, ICRA 2018

WMT18: Shared Task on Multimodal Machine Translation, EMNLP 2018

Shortcomings in Vision and Language, ECCV 2018

Grand Challenge and Workshop on Human Multimodal Language, ACL 2018

Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, EMNLP 2018, EMNLP 2017, NAACL-HLT 2016, EMNLP 2015, ACL 2014, NAACL-HLT 2013

Visual Understanding Across Modalities, CVPR 2017

International Workshop on Computer Vision for Audio-Visual Media, ICCV 2017

Language Grounding for Robotics, ACL 2017

Computer Vision for Audio-visual Media, ECCV 2016

Language and Vision, ACL 2016, EMNLP 2015

Tutorials

Connecting Language and Vision to Actions, ACL 2018

Machine Learning for Clinicians: Advances for Multi-Modal Health Data, MLHC 2018

Multimodal Machine Learning, ACL 2017, CVPR 2016, ICMI 2016

Vision and Language: Bridging Vision and Language with Deep Learning, ICIP 2017

Courses

CMU 11-777, Advanced Multimodal Machine Learning

CMU 16-785, Integrated Intelligence in Robotics: Vision, Language, and Planning

CMU 10-808, Language Grounding to Vision and Control

CMU 11-775, Large-Scale Multimedia Analysis

MIT 6.882, Embodied Intelligence

Georgia Tech CS 8803, Vision and Language

Virginia Tech CS 6501-004, Vision & Language