Multimodal Transformer With Learnable Frontend and Self Attention for Emotion Recognition

This repo contains the code for detecting emotion from the conversational dataset IEMOCAP for the implementation of the paper "Multimodal Transformer With Learnable Frontend and Self Attention for Emotion Recognition" submitted to ICASSP 2022. This repository contains the code when Session 5 is conisdered as test and Session 1 as validation.

Description of the code

The implementation has three stages, namely, training the unimodal audio and text models, training the Bi-GRU with self-attention and the multimodal transformer
With the wav files for the audio and the csv files for text, the first step would be to run audio_model.py and the notebook sentiment_text.ipynb for audio and text respectively
The representations from the trained models in the step above are used to create pickle files for the entire dataset
With these representations, two Bi-GRU models with self-attention (refer to bigru_audio/text.ipynb) is trained. The best models for both audio and text are already provided in the unimodal_models folder.
A multimodal transformer is trained on both the modalities of the dataset for the final accuracy results
Please note that usage of IEMOCAP requires permission. Once this is done, we can share the dataset files. For permission please visit IEMOCAP release

Running the code

Clone the repository https://github.com/iiscleap/multimodal_emotion_recognition.git

For the LEAF-CNN framework for audio sentiment classification, we use this Pytorch implementation for LEAF.

Run python3 -m venv .leaf_venv
Run source .leaf_venv/bin/activate
Run pip install -r requirements_leaf.txt
Clone the repository https://github.com/denfed/leaf-audio-pytorch.git

The files should be arranged as follows:

Sess5
└───leaf_wavs_train
  └───Ses01F_impro01_F000.wav
  └───Ses01F_impro01_F005.wav
  └───...
└───leaf_wavs_test
  └───Ses05F_impro01_F000.wav
  └───Ses05F_impro01_F008.wav
  └───...
label_dict.json
leaf-audio-pytorch-main
│
└───__init__.py
└───setup.py
└───...
└───audio_model.py
└───leaf_audio_pytorch
  └───...

Running sentiment_text.ipynb provides the text unimodal model
For running the two Bi-GRU models with self-attention, run bigru_audio.ipynb to get best_model_aud0.tar and bigru_text.ipynb to get best_model_text0.tar. These are to be placed in the folder unimodal_models.

For running the multimodal transformer we create another environment

Run python3 -m venv .trans_venv
Run source .trans_venv/bin/activate
Run pip install -r requirements.txt
With the config file provided, run python3 main.py
The files at this stage should be arranged as follows:

  main.py
  config.py
  features
  └───<PICKLE_FILE>
  src
  └───model_lstm_tranformers.py
  └───read_data.py
  └───test_lstm_transformers.py
  └───...
  unimodal_models
  └───best_model_aud0.tar
  └───best_model_text0.tar

iiscleap/multimodal_emotion_recognition

Multimodal Transformer With Learnable Frontend and Self Attention for Emotion Recognition

Description of the code

Running the code