Multimodal Human Emotion Recognition

Emotion recognition models are being used in intelligent systems to improve their interaction with humans, as the system can adopt their responses and behavioral patterns according to human emotions and make the interaction more natural. There are 2 broad categories of human emotion recognition: single mode and multimodal emotion recognition. We humans express emotions in various ways like facial expressions, voice, gestures, bodily movements, and posture. Due to which single mode emotion recognition fails in recognizing human emotions "close-to-real-world" environments. Thus, it still remains a challenge. This project explores the use of deep learning techniques to address the problem of machine understanding of human affective behaviour and improve the performance of both unimodal and multimodal emotion recognition models. We study the use of several deep learning networks like convolutional neural networks (CNNs) for this task. We also explore novel fusion techniques like Feature level fusion to capture the latent correlation and complementary information between the modalities, and thereby improving overall emotion recognition accuracy. This project also addresses the issue of lack of sufficient annotated emotion data for training since deep learning models can easily over-fit on small amounts of emotion data and do not generalize well under mismatched condition. The results achieved are: 65% for FER, 73% for SER and 82% for MER.

Dataset

Audio data: Livingstone SR, R. F. (n.d.). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391.

Visual data: Facial Expression Recognition 2013 Database. Available online: https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data