Here is the PyTorch implementation of methods proposed in ''Building Robust Multimodal Sentiment Analysis using a Simple yet Effective Multimodal Transformer'' and ''Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis''.
Paper Title: "Building Robust Multimodal Sentiment Analysis using a Simple yet Effective Multimodal Transformer"
Accepted by: ACM MM 2023 Grand Challenge
Paper Title: "Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis"
Accepted by: ACM MM 2023 MRAC Workshop
Pipeline of our VAT method, consisting of three submodules (an image encoder, a text encoder, and a multimodal encoder):
- To facilitate fusion, we introduce a video-audio contrastive loss that aligns the unimodal representations of a video-audio pair.
- By leveraging in-batch hard negatives obtained through contrastive similarity, a video-audio matching loss is employed to capture multimodal interactions between videos and audios.
- To enhance the model’s robustness to noisy data, we incorporate pseudo-targets generated using the momentum model, which serves as an additional form of supervision during training.
- Ubuntu 16.04
- CUDA Version: 11.1
- PyTorch 1.8.1
- torchvision 0.9.1
- python 3.7.6
Download Original Dataset: CHEAVD2.0, MuSe-Mimic,
For CHEAVD2.0 and MuSe-Mimic dataset, we provide code to pre-process videos into RGB frames and audio wav files in the directory data/
.
COMING SOON.
This research was supported by SenseTime Research.
This project is released under the GNU General Public License v3.0.