/DMER

A survey of deep multimodal emotion recognition.

DMER

A survey of deep multimodal emotion recognition.

Performance Comparsion

Summary of latest papers



Update

2022.03.21 Add papers from ACM MM 2021

2022.05.04 Add the pages of performance comparsion and the summary of latest papers.




Structure



Related Github Repositories



Datasets

Datasets Year Features Paper Used
MEmoR ACM MM 2020 Visual, Audio, Text transcripts Memor: A dataset for multimodal emotion reasoning in videos
EMOTIC Dataset 2019 TPAMI Face, Context Context based emotion recognition using emotic dataset
CMU-MOSEI ACL 2018 Visual, Audio, Language Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
ASCERTAIN Dataset 2018 TAC Facial activity data, Physiological data ASCERTAIN: Emotion and Personality Recognition Using Commercial Sensors
RAVDESS 2018 PLoS ONE Visual, Audio The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English
CMU-MOSI 2016 IEEE Intelligent Systems Visual, Audio, Language Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages
Multimodal Spontaneous Emotion Database (BP4D+) CVPR 2016 Face, Thermal data, Physiological data Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis
EmotiW Database Visual, Audio
LIRIS-ACCEDE Database 2015 TAC Visual, Audio LIRIS-ACCEDE: A Video Database for Affective Content Analysis
CREMA-D 2014 TAC Visual, Audio CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset
POM ICMI 2014 Acoustic Descriptors, Verbal and Para-Verbal Descriptors, Visual Descriptors Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach
SEMAINE Database 2014 TAC Visual, Audio CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset
MAHNOB-HCI 2011 TAC Visual, Eye gaze, Physiological data A Multimodal Database for Affect Recognition and Implicit Tagging
IEMOCAP Database 2008 LRE Visual, Audio, Text transcripts IEMOCAP: Interactive emotional dyadic motion capture database
eNTERFACE Dataset ICDEW 2006 Visual, Audio The eNTERFACE'05 audio-visual emotion database



Related Challenges



Related Projects



Related Reviews



Video-Audio Method

Index Model Paper Year Project Dataset Method
VA-0 VAANet An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos AAAI 2020 [coding] VideoEmotion-8, Ekman-6 Visual+Audio, Attention-Based Model
VA-1 Audio-Visual Emotion Forecasting: Characterizing and Predicting Future Emotion Using Deep Learning FG 2019 [coding] IEMOCAP Face + Speech, Emotion forecasting
VA-2 MMDDN Multimodal Deep Denoise Framework for Affective Video Content Analysi MM 2019 LIRIS-ACCEDE Visual(colors, facial expressions, human gestures) + Audio(pitch, tone, and background music)
VA-3 DBN Deep learning for robust feature generation in audiovisual emotion recognition ICASSP 2019 IEMOCAP Visual + Audio
VA-4 MAFN Multi-Attention Fusion Network for Video-based Emotion Recognition ICMI 2019 AFEW Visual+Audio, Multiple attention fusion network
VA-5 MERML Metric Learning-Based Multimodal Audio-Visual Emotion Recognition 2019 TMM eNTERFACE, CREMA-D Visual+Audio, Video and Audio Feature Extraction and Aggregation, MERML
VA-6 Audio-Visual Emotion Recognition in Video Clips 2019 TAC SAVEE, NTERFACE’05, RML Visual + Audio
VA-7 EmoBed EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings 2019 TAC s RECOLA, OMG-Emotion Face + Audio
VA-8 Affective video content analysis based on multimodal data fusion in heterogeneous networks 2019 Information Fusion Visual + Audio
VA-9 Audio-visual emotion fusion (AVEF): A deep efficient weighted approach 2019 Information Fusion RML, Enterface05, BAUM-1s Visual + Audio
VA-10 Joint low rank embedded multiple features learning for audio–visual emotion recognition 2019 Neurocomputing Visual + Audio
VA-11 MLRF Multimodal Local-Global Ranking Fusion for Emotion Recognition ICMI 2018 AVEC16(RECOLA) Visual + Audio
VA-12 A Combined Rule-Based & Machine Learning Audio-Visual Emotion Recognition Approach 2018 TAC Cohn-Kanade, TheENTERFACE’05 Visual + Audio
VA-13 MMDRBN A Multimodal Deep Regression Bayesian Network for Affective Video Content Analyses ICCV 2017 LIRIS-ACCEDE Video + Audio, Multimodal deep regression Bayesian network
VA-14 ASER Automatic speech emotion recognition using recurrent neural networks with local attention ICASSP 2017 [coding] IEMOCAP Speech
VA-15 Modeling Multimodal Cues in a Deep Learning-Based Framework for Emotion Recognition in the Wild ICMI 2017 AFEW, FER-2013, eNTERFACE dataset Visual + Audio
VA-16 Emotion recognition with multimodal features and temporal models ICMI 2017 EmotiW(2017) Visual+Audio, Temporal model(LSTM) for facial features
VA-17 Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition 2017 T-CSVT [coding] RML dataset, eNTERFACE05, BAUM-1s Visual(3DCNN) +(DBN) Audio(CNN)
VA-18 EEMER End-to-End Multimodal Emotion Recognition Using Deep Neural Networks 2017 IEEE Journal of Selected Topics in Signal Processing [coding] RECOLA Visual + Audio
VA-19 MCEF Multi-cue fusion for emotion recognition in the wild ICMI 2016 cohn-kanade+, enterface’05 Visual + Audio
VA-20 DemoFBVP Multimodal emotion recognition using deep learning architectures WACV 2016 [emoFBVP] Visual + Audio
VA-21 Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition 2013 TAC IEMOCAP, SEMAINE Face + Audio
VA-22 Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition 2012 TMM RML, eNTERFACE Visual + Audio, Kernel cross-modal factor analysis



Context-awarded method

Index Model Paper Year Project Dataset Method
CA-1 EmotiCon EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege’s Principle CVPR 2020 [video] [project] EMOTIC, [GroupWalk] Face+Gait+(Depth+Background), Multiplicative fusion, etc
CA-2 CAER-Net Context-Aware Emotion Recognition Networks ICCV 2019 [coding][project] EMOTIC, AffectNet, [CAER-S], AFEW, [CAER] Face + Context, Adaptive Fusion
CA-3 Context-aware affective graph reasoning for emotion recognition ICME 2019
CA-4 Context Based Emotion Recognition using EMOTIC Dataset 2019 TPAMI [coding] EMOTIC Face + Context
CA-5 Multimodal Framework for Analyzing the Affect of a Group of People 2018 TMM HAPPEI, GAFF Face+Upper body+Scene, Face-based Group-level Emotion Recognition
CA-6 Emotion Recognition in Context CVPR 2017 [project] [EMOTIC] Body feature+Image feature(Context)



Video-Audio-Text method

Index Model Paper Year Project Dataset Method
VAT-1 Self-MM Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis AAAI 2021 [coding] CMU-MOSI, CMU-MOSEI, SIMS Video+Speech+Text
VAT-2 CTNet CTNet: Conversational Transformer Network for Emotion Recognition 2021 TASLP - IEMOCAP, MELD Face+Speech+Text
VAT-3 M3ER M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues AAAI 2020 [video] [project] IEMOCAP, CMU-MOSEI Face+Speech+Text, Multiplicative Fusion, CCA, Modality Check
VAT-4 ARGF Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion AAAI 2020 IEMOCAP, CMU-MOSEI, CMU-MOSI Face+Speech+Text
VAT-5 ICCN Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis AAAI 2020 IEMOCAP, CMU-MOSI, CMU-MOSEI Video+Audio+Text
VAT-6 MISA MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis ACM MM 2020 [coding] MOSI and MOSEI Face+Speech+Text,
VAT-7 CM-BERT CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis ACM MM 2020 [coding] MOSI and MOSEI Speech+Text
VAT-8 SWAFN SWAFN: Sentimental Words Aware Fusion Network for Multimodal Sentiment Analysis ACL 2020 [coding1] [coding2] CMU-MOSI, CMU-MOSEI and YouTube Video+Speech+Text
VAT-9 MAG- Integrating Multimodal Information in Large Pretrained Transformers ACL 2020 [coding] CMU-MOSI, CMU-MOSEI Video+Speech+Text
VAT-10 MTEE Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition AACL-IJCNLP 2020 [coding] IEMOCAP, CMU-MOSEI Visual + Text + Audion
VAT-11 Multimodal Routing Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis EMNLP 2020 [coding] IEMOCAP, CMU-MOSEI Face+Speech+Text
VAT-12 SF-SSL Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition InterSpeech 2020 [coding] IEMOCAP, CMU-MOSI, CMU-MOSEI Video+Speech+Text
VAT-13 Multimodal Deep Learning Framework for Mental Disorder Recognition FG 2020 [coding] Bipolar Disorder Corpus (BDC), Extended Distress Analysis Interview Corpus (E-DAIC) Visual + Audio + Text
VAT-14 Deep-HOSeq Deep Higher Order Sequence Fusion for Multimodal Sentiment Analysis ICDM 2020 [coding] CMU-MOSEI and CMU-MOSI Visual + Audio + Text
VAT-15 DFF-ATMF Complementary fusion of multi-features and multi-modalities in sentiment analysis AAAI-WK 2020 [coding] CMU-MOSEI and CMU-MOSI, IEMOCAP Visual + Audio + Text
VAT-16 MSAF MSAF: Multimodal Split Attention Fusion 2020 [coding] RAVDESS, CMU-MOSEI, NTU RGB+D Visual + Audio + Text
VAT-17 DCVDN Visual-Texual Emotion Analysis With Deep Coupled Video and Danmu Neural Networks 2020 TMM ISEAR, Multi-Domain Sentiment DataseT, [Video-Danmu] Visual + Text
VAT-18 LMFN Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing 2020 TMM CMU-MOSEI, MELD, IEMOCAP Visual + Audio + Language
VAT-19 SSE-FT Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion 2020 IEEE ACCESS [coding] CMU-MOSEI, IEMOCAP Visual + Audio + Language
VAT-20 HPFN Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling NeurIPS 2019 [coding] CMU-MOSI, IEMOCAP Visual+Audio+Text
VAT-21 MFM Learning Factorized Multimodal Representations ICLR 2019 [coding] CMU-MOSI, IEMOCAP, POM, MOUD, ICT-MMMO, and YouTube Visual+Audio+Text
VAT-22 MCTN Found in translation: Learning robust joint representations by cyclic translations between modalities AAAI 2019 [coding] CMU-MOSI, ICTMMMO, and YouTube Visual+Audio+Text
VAT-23 RAVEN Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors AAAI 2019 [coding] CMU-MOSI,IEMOCAP Visual+Audio+Text
VAT-24 HFFN Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing ACL 2019 [coding] CMU-MOSEI, IEMOCAP Visual + Audio + Language
VAT-25 DeepCU DeepCU: Integrating both Common and Unique Latent Information for Multimodal Sentiment Analysis IJCAI 2019 [coding] CMU-MOSI, POM Visual+Audio+Text
VAT-26 MMResLSTM Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition MM 2019 [coding] IEMOCAP, MELD Audio+Text, Multimodal residual LSTM
VAT-27 CIA Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis EMNLP 2019 MOUD, MOSI, YouTube, ICT-MMMO, and MOSEI Video+Audio+Text, Use a recurrent neural network to learn the inter-modal interaction among modalities with an auto-encoder.
VAT-28 CIM-MTL Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis NAACL 2019 MOSEI Video+Audio+Text
VAT-29 LAMER Learning Alignment for Multimodal Emotion Recognition from Speech InterSpeech 2019 [coding] IEMOCAP Speech + Text
VAT-30 MHA Speech emotion recognition using multi-hop attention mechanism ICASSP 2019 [coding] IEMOCAP Audio + Text
VAT-31 MSER Multimodal speech emotion recognition and ambiguity resolution 2019 [coding] IEMOCAP Audio + Text
VAT-32 EmoBed EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings 2019 TAC RECOLA, OMG-Emotion Visual + Text
VAT-33 MulT Multimodal Transformer for Unaligned Multimodal Language Sequences 2019 NIH Public Access [coding] CMU-MOSI & MOSEI, IEMOCAP Visual + Text + Speech
VAT-34 MFN Memory fusion network for multi-view sequential learning AAAI 2018 [coding] CMU-MOSI, ICT-MMMO,YouTube, MOUD, IEMOCAP and POM Visual + Text + Speech
VAT-35 MARN Multi-Attention Recurrent Network for Human Communication Comprehension AAAI 2018 [coding] CMU-MOSI, ICT-MMMO,YouTube, MOUD, IEMOCAP and POM Visual + Text + Speech
VAT-36 Graph-MFN Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph ACL 2018 [coding] [CMU-MOSEI] Visual + Audio + Text
VAT-37 LMF Efficient Low-rank Multimodal Fusion with Modality-Specific Factors ACL 2018 [coding] IEMOCAP, CMU-MOSEI, POM Visual + Audio + Text, Perform multimodal fusion using low-rank tensors
VAT-38 RMFN Multimodal Language Analysis with Recurrent Multistage Fusion EMNLP 2018 IEMOCAP, CMU-MOSI, POM Visual + Audio + Text
VAT-39 MRTN Multimodal Relational Tensor Network for Sentiment and Emotion Classification ACL-Challenge-HML 2018 [coding] CMUMOSEI Visual + Audio + Text
VAT-40 Convolutional attention networks for multimodal emotion recognition from speech and text data ACL-Challenge-HML 2018 CMU-MOSEI Speech + Text
VAT-41 HFusion Multimodal sentiment analysis using hierarchical fusion with context modeling 2018 Knowledge-Based Systems [coding] MOSI, IEMOCAP Visual + Audio + Text
VAT-42 BC-LSTM Context-dependent sentiment analysis in user-generated videos ACL 2017 [coding] CMU-MOSI, MOUD, IEMOCAP Audio+Video+Text
VAT-43 TFN Tensor Fusion Networks for multimodal sentiment analysis EMNLP 2017 [coding] CMU-MOSI Audio+Video+Text
VAT-44 MLMA Multi-level multiple attentions for contextual multimodal sentiment analysis ICDM 2017 [coding] CMU-MOSI Audio+Video+Text
VAT-45 MV-SLTM Extending long short-term memory for multi-view structured learning ECCV 2016 MMDB Audio+Video+Text
VAT-46 Fusing audio, visual and textual clues for sentiment analysis from multimodal content 2016 Neurocomputing YouTube dataset, SenticNet, EmoSenticNet Visual + Audio + Text
VAT-47 Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities 2015 TC SEMAINE Face + Audio + Lexical features, Temporal Bayesian fusion
VAT-48 Towards an intelligent framework for multimodal affective data analysis 2015 Neural Networks eNTERFACE Visual + Audio + Text



Attribute-based

Index Model Paper Year Project Dataset Method
AB-1 MMDRBN Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging 2019 TMM LIRIS-ACCEDE Visual + Audio + Attribute
AB-2 Recognizing Induced Emotions of Movie Audiences From Multimodal Information 2019 TAC LIRIS-ACCEDE Visual + Audio + Dialogue + Attribute
AB-3 Multimodal emotional state recognition using sequence-dependent deep hierarchical features 2015 Neural Networks FABO Face + Upper-body
AB-4 Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification 2012 TAC IEMOCAP Visual + Audio + Utterance
AB-5 Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space 2011 TAC SAL-DB Face + Shoulder gesture + Audio



Aspect-based Network

Index Model Paper Year Project Dataset Method
ABN-1 MIMN Multi-Interactive Memory Network for Aspect Based Multimodal Sentiment Analysis AAAI 2019 [coding] [Multi-ZOL] Text+Aspect+Images, Aspect based multimodal sentiment analysis
ABN-2 VistaNet VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis AAAI 2019 [coding] [Yelp-Food-Restaurants] Visual+Text
ABN-3 Cooperative Multimodal Approach to Depression Detection in Twitter AAAI 2019 Textual Depression Dataset, Multimodal Depression Dataset Visual+Text, GRU+VGG-Net+COMMA
ABN-4 TomBERT Adapting BERT for Target-Oriented Multimodal Sentiment Classification IJCAI 2019 [coding] Multimodal Twitter datasets Image+Text, BERT-based
ABN-5 Predicting Emotions in User-Generated Videos AAAI 2014 [Dataset] Visual+Audio+Attribute, Video content recognition



Physiological Signal

Index Model Paper Year Project Dataset Method
PS-1 Emotion Recognition From Multimodal Physiological Signals Using a Regularized Deep Fusion of Kernel Machine 2020 TC EEG + Other physiological signals
PS-2 MMResLSTM Emotion Recognition using Multimodal Residual LSTM Network MM 2019 DEAP EEG + PPS(EOG+EMG), Multimodal residual LSTM
PS-3 EmotionMeter: A Multimodal Framework for Recognizing Human Emotions 2019 TC EEG + Eye movements
PS-4 Personality-Aware Personalized Emotion Recognition from Physiological Signals IJCAI 2018 ASCERTAIN Personality+Physiological signals, Personalized emotion recognition
PS-5 Combining Facial Expression and Touch for Perceiving Emotional Valence 2018 TAC KDEF Face + Touch stimuli
PSB-6 Multi-modality weakly labeled sentiment learning based on Explicit Emotion Signal for Chinese microblog 2018 Neurocomputing Face + Touch stimuli
PS-7 Analysis of EEG Signals and Facial Expressions for Continuous Emotion Detection 2016 TAC Face + EEG signals
PS-8 Multi-modal emotion analysis from facial expressions and electroencephalogram 2016 Computer Vision and Image Understanding Face + EEG
PS-9 Combining Eye Movements and EEG to Enhance Emotion Recognition IJCAI 2015 [dataset] Eye Movements+EEG signal
PS-10 Multimodal Emotion Recognition in Response to Videos 2012 TAC Eye gaze + EEG signals
PS-11 HetEmotionNet HetEmotionNet: Two-Stream Heterogeneous Graph Recurrent Neural Network for Multi-modal Emotion Recognition MM 2021 [coding] DEAP, MAHNOB-HCI EEG+PPS+GTN, A novel two-stream heterogeneous graph recurrent neural network
PS-12 Simplifying Multimodal Emotion Recognition with Single Eye Movement Modality MM 2021 SEED, SEED-IV, SEED-V EEG+Eye movement signal, A generative adversarial network-based framework