Multimodal Emotion Recognition in PyTorch (MER23 & MuSe-Mimic Challenges)

Here is the PyTorch implementation of methods proposed in ''Building Robust Multimodal Sentiment Analysis using a Simple yet Effective Multimodal Transformer'' and ''Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis''.

Paper Title: "Building Robust Multimodal Sentiment Analysis using a Simple yet Effective Multimodal Transformer"

Accepted by: ACM MM 2023 Grand Challenge

Paper Title: "Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis"

Accepted by: ACM MM 2023 MRAC Workshop

Method Introduction

Pipeline of our VAT method, consisting of three submodules (an image encoder, a text encoder, and a multimodal encoder):

To facilitate fusion, we introduce a video-audio contrastive loss that aligns the unimodal representations of a video-audio pair.
By leveraging in-batch hard negatives obtained through contrastive similarity, a video-audio matching loss is employed to capture multimodal interactions between videos and audios.
To enhance the model’s robustness to noisy data, we incorporate pseudo-targets generated using the momentum model, which serves as an additional form of supervision during training.

Main Dependencies

Ubuntu 16.04
CUDA Version: 11.1
PyTorch 1.8.1
torchvision 0.9.1
python 3.7.6

Usage

Data Preparation

Download Original Dataset： CHEAVD2.0, MuSe-Mimic,

Pre-processing

For CHEAVD2.0 and MuSe-Mimic dataset, we provide code to pre-process videos into RGB frames and audio wav files in the directory data/.

Training and Evaluation

COMING SOON.

Acknowledgement

This research was supported by SenseTime Research.

License

This project is released under the GNU General Public License v3.0.

dingchaoyue/Multimodal-Emotion-Recognition-Challenges