learning representations for emotions in speech
Note: We use voice encoder from the Resemblyzer project. Link to the project: https://github.com/resemble-ai/Resemblyzer
The speech files have the following information encoded in the filename. The numbers denote the placeholders in the filename separated by '-'.
-
- Modality: (01 = full-AV, 02 = video-only, 03 = audio-only).
-
- Vocal channel: (01 = speech, 02 = song).
-
- Emotion: (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
-
- Emotional intensity: (01 = normal, 02 = strong). NOTE: There is no strong intensity for the ‘neutral’ emotion.
-
- Statement: (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”).
-
- Repetition: (01 = 1st repetition, 02 = 2nd repetition).
-
- Actor: (01 to 24. Odd numbered actors are male, even numbered actors are female).
- If a folder named 'data' is missing in the base folder please create a folder named 'data' in the main folder (speech_emotions) and a folder named 'raw', 'interim', and 'processed' inside the data folder
Replace the 'skip' term with 'run' in the functions that you want to run as shown below
- To skip : with skip_run('skip', 'download_RAVDESS_data') as check, check():
- To run : with skip_run('run', 'download_RAVDESS_data') as check, check():
All the files are stored as .h5 files in data folder for example "data/interim/filename.h5"
the structure of the dictionaries:
- speech1 ("Kids are talking by the door") and speech2 ("Dogs are sitting by the door") are separatedly stored as dictionaries.
- the structure of dictionaries are data['Actor_#']['emotion_#']['intensity_#']['repete_#'], here # are the numbers mentioned in the file information above
Some important resources:
- Discovering Neural wirings: https://mitchellnw.github.io/blog/2019/dnw/
- The super Duper NLP repo: https://notebooks.quantumstat.com/
- Variation auto encoders : https://www.jeremyjordan.me/variational-autoencoders/
- Building an end-to-end Speech Recognition model in PyTorch
Code Resources:
- Beta-VAE = https://github.com/1Konny/Beta-VAE
- Pytorch-VAE: https://github.com/AntixK/PyTorch-VAE
- Semi-Supervised PyTorch https://github.com/wnhsu/semi-supervised-pytorch
- Factorized Hierarchical Variational Autoencoders: https://github.com/wnhsu/FactorizedHierarchicalVAE
- Predictive Speech VAE
Papers: 1)Unsupervised Learning of Disentangled andInterpretable Representations from Sequential Data
STFT convert audio to melspectrogram performs better than direct librosa audio to mel. Use STFT with Waveglow for better audio -> mel -> audio conversions
-
- audio -> melspectrogram -> power_to_db -> Used in our models -> db_to_power -> inverse.mel_to_audio ->audio
- 80 Mels spectrograms have higher frequecy resolution on the spectrogram compared to 40 Mels but the pattern is captured
-
- Requires 80 mels with n_fft: 1024, hop_length : 256, win_length : 1024
- Good audio generation for sampling rate 22050
Care must be taken if np.random.randint is used inside the getitem() of torch.utils.data.DataLoader or if data is loaded in parallel using num_workers. Please refer to the following document for more information Using PyTorch + NumPy? You're making a mistake.