/Speech-Emotion-Recognition-ROS

For Korean speech emotion detect, this model is trained by Korean dataset. There is no enough Korean dataset, so i tried to make this repo.

Primary LanguageJupyter Notebook

SPEECH-EMOTION-RECOGNTION

This repo illustrates how to use Speech-emotion-recognition module with ROS.

We only use voice with korean, not text.

We need Ros, audio_common and pytorch.

In audio_common_msgs, we must add Audio_result.msg and command.msg

requirements must be installed. And Ros settings also required. I use ros-kinnetic.


Datasets

  • KESDy18

    • We use KESDy18 Korean emotion datasets.
    • This includes 2880 wav files. And we only use 4 emotions. (0 = angry, 1 = neutral, 2 = sad, 3 = happy)
    • You can download datafiles in here after submit License Agreement.
  • AIHUB

    • We use Aihub Korean emotion datasets.
    • This includes about 50,000 wav files with text, ages, ...
    • We only use about 2200 data that emotion is classified clearly.
    • You can download datafiles in here
  • CUSTOM DATA

    • I recorded this myself.
    • 11 sentences, with 2 levels, and 4 emotions. So 88 datafiles.

Feature Extraction & Model

  • Feature Extraction

    • For feature extration we make use of the LIBROSA library

      1. using mfccs to feature extraction. cut audio file in 2.5 duration and make 32 mfccs tensor shape to train in DenseNet121.
      2. using mel-spectrogram make audio file to spectrum image and save it. load images to train DenseNet(pretrained=True).
    • Model we use DenseNet121. we choose to use densenet since model have to be light to run on 'cpu' settings.

      IMAGE SAMPLE

      • It's hard to see the difference.

      ANGRY

      angry

      SAD

      sad

      NEUTRAL

      neutral

      HAPPY

      happy


Train Result

  • Result without augmentation(DenseNet121)

    Data Pretrained Feature Extraction accuracy/(custom data)
    ETRI False mfccs 70%/25%
    ETRI False mel-spectrogram 73%/29%
    AIHUB True mel-spectrogram 69%/40%
    AIHUB False mel-spectrogram 60%/35%
    ETRI+AIHUB True mel-spectrogram 68%/33%
    ETRI+AIHUB False mel-spectrogram 63%/28%
    • using mfccs in ETRI make overfitting in train data. and not good at accuracy. so we decide to use mel-spectrogram.
    • ETRI dataseDts also too artificial, so not fit with custom data.
    1. Result confusion matrix (accuracy = 73%)

    result_matrix_img_etri

    1. Result confusion matrix for custom data (accuracy = 40%)

    result_matrix_img_sw_aihub_pretrained

  • Result with data augmentation

    Data Pretrained Feature Extraction accuracy/(custom data)
    AIHUB + CUSTOM True mel-spectrogram 86.57%/83%
    1. Finally we use data augmentation. Result confusion matrix(AIhub + Custom Data W augmentation)(accuracy = 83%)

    result_matrix_img_aihub+custom+augmentaion


How to use

  • How to trained (pytorch)

    • First, clone this repo.
    • Train code created by jupyter notebook (python 3.8.12).
    • Trainer code located in './trainer'
    • Locate wav files in './data' and do preprocessing to csv or list.
    • Select model in torchvision.models(using DenseNet in this code) and chage input size(in_features) to fit the model.
      model.classifier = nn.Linear(in_features=1024, out_features=4)
      
  • How to record

    • run record_4sec.py -> .wav file will saved in './predict_audio'
      python record_4sec.py
      
  • How to predict

    • Locate wav file in './predict_audio'.
      you must locate only one file in this dir or fix the code.
      python predict_torch_img.py
      
      
  • How to use in ROS

    1. Run audio_capture.launch
    2.  rosrun speech-emotion-ros predict.py
      
    3.  rosrun speech-emotion-ros command.py
      
    4. 'start' button to start recording, 'end' button to end recording and predict.

Reference