/Multimodal-Emotion-Recognition-Challenges

Multimodal emotion recognition code implementation on MER23 and MuSe challenges

Multimodal Emotion Recognition in PyTorch (MER23 & MuSe-Mimic Challenges)

Here is the PyTorch implementation of methods proposed in ''Building Robust Multimodal Sentiment Analysis using a Simple yet Effective Multimodal Transformer'' and ''Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis''.

Paper Title: "Building Robust Multimodal Sentiment Analysis using a Simple yet Effective Multimodal Transformer"

Accepted by: ACM MM 2023 Grand Challenge

Paper Title: "Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis"

Accepted by: ACM MM 2023 MRAC Workshop

Method Introduction

Pipeline of our VAT method, consisting of three submodules (an image encoder, a text encoder, and a multimodal encoder):

  1. To facilitate fusion, we introduce a video-audio contrastive loss that aligns the unimodal representations of a video-audio pair.
  2. By leveraging in-batch hard negatives obtained through contrastive similarity, a video-audio matching loss is employed to capture multimodal interactions between videos and audios.
  3. To enhance the model’s robustness to noisy data, we incorporate pseudo-targets generated using the momentum model, which serves as an additional form of supervision during training.

Main Dependencies

  • Ubuntu 16.04
  • CUDA Version: 11.1
  • PyTorch 1.8.1
  • torchvision 0.9.1
  • python 3.7.6

Usage

Data Preparation

Download Original Dataset: CHEAVD2.0, MuSe-Mimic,

Pre-processing

For CHEAVD2.0 and MuSe-Mimic dataset, we provide code to pre-process videos into RGB frames and audio wav files in the directory data/.

Training and Evaluation

COMING SOON.

Acknowledgement

This research was supported by SenseTime Research.

License

This project is released under the GNU General Public License v3.0.