This repo contains a "simplified" implementation of SERAB, which includes:
- BYOL-A training and utility functions (Original repo: https://github.com/nttcslab/byol-a)
- BYOL-A and transformer-inspired models
- Kudos to Phil Wang for his implementation of CvT (https://github.com/lucidrains/vit-pytorch)
- Benchmark tests for SERAB
- TFDS scripts to load SERAB data
Update: BYOL-S was one of the strongest submissions of the HEAR NeurIPS 2021 Challenge! Leaderboard results: https://neuralaudio.ai/hear2021-results.html
Libraries to reproduce the environment are detailed in serab.yml
.
To reproduce the environment, run:
conda env create -f serab.yml
To install the external source files from patches, copy the following after cloning the repo:
cd SERAB/
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/config.yaml
patch --ignore-whitespace < config.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/train.py
patch < train.diff
cd byol_a/
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/augmentations.py
patch < augmentations.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/common.py
patch < common.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/dataset.py
patch < dataset.diff
curl -O https://raw.githubusercontent.com/nttcslab/byol-a/f2451c366d02be031a31967f494afdf3485a85ff/byol_a/models.py
mv models.py models/audio_ntt.py
In this simplified version, only PyTorch models can be used.
Before running the evaluation, make sure that the config file config.yaml
is correctly setup for your model.
To run a pre-existing model, run:
python clf_benchmark.py --model_name {MODEL_NAME} --dataset_name {DATASET_NAME}
By default, grid-search-based classifier hyperparameter optimization is performed. To run a pre-existing model with the "default" classifiers, add the model_selection --none
key:
python clf_benchmark.py --model_name {MODEL_NAME} --dataset_name {DATASET_NAME} --model_selection none
To run a model on all the SERAB datasets, DVC can be used.
Make the appropriate modifications in dvc.yaml
and run:
dvc repro
Models can be pre-trained on a subsample of AudioSet that only contains speech.
You might need to do changes in train.py
and config.yaml
before starting training.
To train a model, run:
python train.py {MODEL_NAME} # or dvc repro
As training time is usually long (10-20h depending on the model), we recommend using tmux to attach & detach terminals from a given session.
While CREMA-D and SAVEE are already integrated into TFDS, the other datasets were added as tensorflow datasets.
The code to load these datasets can be found in tensorflow_datasets
.
Here are the steps to download and load the SERAB datasets:
- In the
tensorflow_datasets
folder, create the foldersdownload/manual
- Download the compressed datasets (.zip files) under
tensorflow_datasets/download/manual/
Link to the SERAB Datasets:
- AESDD: http://m3c.web.auth.gr/research/aesdd-speech-emotion-recognition/
- CaFE: https://zenodo.org/record/1478765
- EmoDB: http://emodb.bilderbar.info/download/
- EMOVO: http://voice.fub.it/activities/corpora/emovo/index.html
- IEM4 (restricted access): https://sail.usc.edu/iemocap/
- RAVDESS: https://smartlaboratory.org/ravdess/
- SAVEE (restricted access): http://kahlan.eps.surrey.ac.uk/savee/Download.html
- ShEMO: https://github.com/mansourehk/ShEMO
- SUBESCO: https://zenodo.org/record/4526477#.YcyUeGjMJPY
-
Ensure all samples in a given datasets are all mono or stereo! You can use
stereo_to_mono.py
in serab.utils to convert all stereo audios to mono. -
Build each dataset using the TFDS CLI:
cd tensorflow_datasets/{DATASET_NAME}
tfds build # Download and prepare the dataset to `~/tensorflow_datasets/
The datasets are now ready to use!
If you are using this code, please cite the paper:
@article{scheidwasser2021serab,
title={SERAB: A multi-lingual benchmark for speech emotion recognition},
author={Scheidwasser-Clow, Neil and Kegler, Mikolaj and Beckmann, Pierre and Cernak, Milos},
journal={arXiv preprint arXiv:2110.03414},
year={2021}
}