multimodal-speech-emotion

This repository contains the source code used in the following paper,

Multimodal Speech Emotion Recognition using Audio and Text, IEEE SLT-18, [paper]

tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
python==2.7
scikit-learn==0.20.0
nltk==3.3

IEMOCAP [link] [paper]
download IEMOCAP data from its original web-page (license agreement is required)

for the preprocessing, refer to codes in the "./preprocessing"
If you want to download the "preprocessed corpus" from us directly, please send us an email after getting the license from IEMOCAP team.
We cannot publish ASR-processed transcription due to the license issue (commercial API), however, we assume that it is moderately easy to extract ASR-transcripts from the audio signal by oneself. (we used google-cloud-speech-api)
Examples

MFCC : MFCC features of the audio signal (ex. train_audio_mfcc.npy)
MFCC-SEQN : valid lenght of the sequence of the audio signal (ex. train_seqN.npy)
PROSODY : prosody features of the audio signal (ex. train_audio_prosody.npy)
LABEL : targe label of the audio signal (ex. train_label.npy)
TRANS : sequences of trasnciption (indexed) of a data (ex. train_nlp_trans.npy)