Jointly Fine-Tuning “BERT-like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition
This repositary consist the pytorch code for Multimodal Emotion Recogntion with pretreined Roberta and Speech-BERT.
- This code strcuture is built on top of Faiseq interface
- Fairseq is an open source project by FacebookAI team that combined different SOTA architectures for sequencial data processing
- This also consist of SOTA optimizing mechanisms such as ealry stopage, warup learnign rates, learning rate shedulers
- We are trying to develop our own architecture in compatible with fairseq interface.
- For more understanding please read the paper published about Fairseq interaface.
-
This can be bit tricky in the beggining. First it is important to udnestand that Fairseq has built in a way that all architectures can be access through the terminal commands (args).
-
Since our architecture has lot of properties in tranformer architecture, we followed the a tutorial that describe to use Roberta for the custom classification task.
-
We build over archtiecture by inserting new stuff to following directories in Fairseq interfeace.
- fairseq/data
- fairseq/models
- fairseq/modules
- fairseq/tasks
- fairseq/criterions
-
Custom dataloader for load raw audio, faceframes and text is in the fairseq/data/raw_audio_text_dataset.py
-
The task of the emotion prediction similar to other tasks such as translation is in the fairseq/tasks/emotion_prediction.py
-
The custom architecture of our model similar to roberta,wav2vec is in the fairseq/models/mulT_emo.py
-
The cross-attention was implemted by modifying the self attentional scripts in original fairseq repositary. They can be found in fairseq/modules/transformer_multi_encoder.py and fairseq/modules/transformer_layer.py
-
Finally the cutom loss function and ebaluation scripts can be found it fairseq/criterions/emotion_prediction_cri.py
Please use following links to downlaod the pretrained SSL models and save them in a seperate folder named pretrained_ssl.
- For speech fetures - VQ-wav2vec
- For sentence (text) features - Roberta
- For text data, we first tokenized it with Roberta tokenizer and save each example in to seperate text files.
- To preprocess speech data please refer the script given in convert_aud_to_token.py.
- The preprocessed datasets and their labels can be found in the this google drive.
We followed the Fairseq terminal commands to train and validate our models.
- --data - folder that contains filenames, sizes and labels of your raw data (please refer to the T_data folder).
- --data-raw - Path of your raw data folder that contains tokenized speech and text.
- --binary-target-iemocap - train the model with Iemocap data for binary accuracy.
- --regression-target-mos - train the model with CMU-MOSEI/CMU-MOSI data for sentiment score.
- For dataset specific traing commands please refer to emotion_prediction.py.
CUDA_VISIBLE_DEVICES=8,7 python train.py --data ./T_data/iemocap --restore-file None --task emotion_prediction --reset-optimizer --reset-dataloader --reset-meters --init-token 0 --separator-token 2 --arch robertEMO_large --criterion emotion_prediction_cri --num-classes 8 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 1e-05 --total-num-update 2760 --warmup-updates 165 --max-epoch 10 --best-checkpoint-metric loss --encoder-attention-heads 2 --batch-size 1 --encoder-layers-cross 1 --no-epoch-checkpoints --update-freq 8 --find-unused-parameters --ddp-backend=no_c10d --binary-target-iemocap --a-only --t-only --pooler-dropout 0.1 --log-interval 1 --data-raw ./iemocap_data/
CUDA_VISIBLE_DEVICES=1 python validate.py --data ./T_data/iemocap --path './checkpoints/checkpoint_best.pt' --task emotion_prediction --valid-subset test --batch-size 4
If you want to pre-process data again please refer to this repositary.