
Inspired work by the project of SER using ELM at Microsoft Research

Speech Emotion Recognition on Utterance Level

Original Aim

Emotions are fundamental for humans, impacting perception and everyday activities such as communication, learning and decision-making. Recently, SER has been drawing increasing attention. Speech emotion recognition is a very challenging task, since machines do not understand human emotion states, of which extracting effective emotional features is an open question.

In this project, we will explore the several contributions in this area, and find out the significant algorithm doing the emotion detection from speech. We are particularly interested to compare the human-engineered features to the raw representations in human speech.


The data used for this project is Interactive Emotional Dyadic Motion Capture (IEMOCAP) database which comes from Signal Analysis and Interpretation Laboratory at the University of South California. It contains 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions[3]. The recordings consist of professional actors improvising and scripting a series of semantically neutral utterances spanning ten distinct emotional categories. There were 5 female speakers and 5 male speakers. The number and count ratio of utterances that belong to each emotion category is shown in table.

Ang Hap Exc Neu Sad
Counts 1103 595 1041 1708 1084
Ratio 19.9% 10.8% 18.8% 30.9% 19.6%


- Support Vector Machine

- K Nearest Neighbors

- Deep Neural Networks

- Extreme Learning Machine


Low Level Descriptors MFCC, Mel-filterbank, formant, HNR, jitter, shimmer, etc.
High-level Statistical Functions mean, variance, max, min, median, etc.


Segment-level Features Extraction

  • MFCC

    Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum") (wiki)

  • Harmonic to Noise Ratio

  • Pitch Period

Utterance-level Features Extraction

  • Maximal

  • Minimal

  • Average

  • Percentage above certain threshold

Extreme Learning Machine introduce

  • Network architectures: a homogenous hierarchical learning machine for partially or fully connected multi layers / single layer of (artifical or biological) networks with almost any type of practical (artifical) hidden nodes (or bilogical neurons).

  • Learning theories: Learning can be made without iteratively tuning (articial) hidden nodes (or biological neurons).

  • Learning algorithms: General, unifying and universal (optimization based) learning frameworks for compression, feature learning, clustering, regression and classification. Basic steps:

    1. Learning are made layer wise (in white box)

    2. Randomly generate (any nonliear piecewise) hidden neurons or inheritate hidden neuorns from ancestors

    3. Learn the output weights in each hidden layer (with application based optimization constraints)

Other Models


