voice_datasets

A comprehensive list of open source voice and music datasets. I released this for the talk @ the VOICE Summit 2019.

Audio datasets

There are two main types of audio datasets: speech datasets and audio event/music datasets.

Speech datasets

AESDD - around 500 utterances by a diverse group of actors (over 5 actors) simlating various emotions.
ANAD - 1384 recording by multiple speakers; 3 emotions: angry, happy, surprised.
Arabic Speech Corpus - The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes.
ASR datasets - A list of publically available audio data that anyone can download for ASR or other speech activities
AudioMNIST - The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
Awesome_Diarization - A curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
BAVED - 1935 recording by 61 speakers (45 male and 16 female).
CaFE - 6 different sentences by 12 speakers (6 fmelaes + 6 males).
Common Voice - Common Voice is Mozilla's initiative to help teach machines how real people speak. 12GB in size; spoken text based on text from a number of public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora.
CHIME - This is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
Coswara - A database that contains respiratory sounds, namely, cough, breath, and speech of healthy and COVID-19 positive individuals.
CMU-MOSEI - 65 hours of annotated video from more than 1000 speakers and 250 topics; 6 Emotion (happiness, sadness, anger,fear, disgust, surprise) + Likert scale.
CMU-MOSI - 2199 opinion utterances with annotated sentiment; Sentiment annotated between very negative to very positive in seven Likert steps.
CMU Wilderness - (noncommercial) - not available but a great speech dataset many accents reciting passages from the Bible.
CREMA-D - CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
DAPS Dataset - DAPS consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker).
Deep Clustering Dataset - Training deep discriminative embeddings to solve the cocktail party problem.
DEMoS - 9365 emotional and 332 neutral samples produced by 68 native speakers (23 females, 45 males); 7/6 emotions: anger, sadness, happiness, fear, surprise, disgust, and the secondary emotion guilt.
DES - 4 speakers (2 males and 2 females); 5 emotions: neutral, surprise, happiness, sadness and anger.
DIPCO - Dinner Party Corpus - The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes.
EEKK - 26 text passage read by 10 speakers; 4 main emotions: joy, sadness, anger and neutral.
Emo-DB - 800 recording spoken by 10 actors (5 males and 5 females); 7 emotions: anger, neutral, fear, boredom, happiness, sadness, disgust.
EmoFilm - 1115 audio instances sentences extracted from various films.
EmoSynth - 144 audio file labelled by 40 listeners; Emotion (no speech) defined in regard of valence and arousal.
Emotional Voices Database - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).
Emotional Voice dataset - Nature - 2,519 speech samples produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
EmotionTTS - Recordings and their associated transcriptions by a diverse group of speakers - 4 emotions: general, joy, anger, and sadness.
Emov-DB - Recordings for 4 speakers- 2 males and 2 females; The emotional styles are neutral, sleepiness, anger, disgust and amused.
EMOVO - 6 actors who played 14 sentences; 6 emotions: disgust, fear, anger, joy, surprise, sadness.
eNTERFACE05 - Videos by 42 subjects, coming from 14 different nationalities; 6 emotions: anger, fear, surprise, happiness, sadness and disgust.
Free Spoken Digit Dataset -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations.
Flickr Audio Caption - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size.
GEMEP corpus - 10 actors portraying 10 states; 12 emotions: amusement, anxiety, cold anger (irritation), despair, hot anger (rage), fear (panic), interest, joy (elation), pleasure(sensory), pride, relief, and sadness. Plus, 5 additional emotions: admiration, contempt, disgust, surprise, and tenderness.
IEMOCAP - 12 hours of audiovisual data by 10 actors; 5 emotions: happiness, anger, sadness, frustration and neutral.
ISOLET Data Set - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.
JL corpus - 2400 recording of 240 sentences by 4 actors (2 males and 2 females); 5 primary emotions: angry, sad, neutral, happy, excited. 5 secondary emotions: anxious, apologetic, pensive, worried, enthusiastic.
Keio-ESD - A set of human speech with vocal emotion spoken by a Japanese male speaker; 47 emotions including angry, joyful, disgusting, downgrading, funny, worried, gentle, relief, indignation, shameful, etc.
LEGO Corpus - 347 dialogs with 9,083 system-user exchanges; emotions classified as garbage, non-angry, slightly angry and very angry.
Libriadapt - It is primarily designed to faciliate domain adaptation research for ASR models, and contains the following three types of domain shifts in the data.
Libri-CSS - derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
LibriMix - LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments.
Librispeech - LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
LJ Speech - This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Microsoft Scalable Noisy Speech Dataset - The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.
MSP-IMPROV - 20 sentences by 12 actors; 4 emotions: angry, sad, happy, neutral, other, without agreement
MSP Podcast Corpus - 100 hours by over 100 speakers - annotated with emotional labels using attribute-based descriptors (activation, dominance and valence) and categorical labels (anger, happiness, sadness, disgust, surprised, fear, contempt, neutral and other).
Multimodal EmotionLines Dataset (MELD) - Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Each utterance in a dialogue has been labeled with— Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear.
MuSe-CAR - 40 hours, 6,000+ recordings of 25,000+ sentences by 70+ English speakers (15 GB).
NISQA Speech Quality Corpus - includes 14k speech samples with simulated (codecs, packet-loss, background noise) and live (mobile phone, Zoom, Skype, WhatsApp) voice call degradation conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness.
Noisy Dataset- Clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. Also known as VBD, Voice Bank + DEMAND. Speech samples from VCTK dataset.
OGVC - 9114 spontaneous utterances and 2656 acted utterances by 4 professional actors (two male and two female); 9 emotional states: fear, surprise, sadness, disgust, anger, anticipation, joy, acceptance and the neutral state.
OpenSLR - Many audio datasets (>109) published for speech recognition purposes.
Parkinson's speech dataset - The training data belongs to 20 Parkinson’s Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken for this 20 MB set.
Persian Consonant Vowel Combination (PCVC) Speech Dataset - The Persian Consonant Vowel Combination (PCVC) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
RECOLA - 3.8 hours of recordings by 46 participants; negative and positive sentiment (valence and arousal).
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) - The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions.
sample_voice_data - 52 audio files per class (males and females) for testing purposes.
SAVEE Dataset - 4 male actors in 7 different emotions, 480 British English utterances in total.
SEMAINE - 95 dyadic conversations from 21 subjects. Each subject converses with another playing one of four characters with emotions; 5 FeelTrace annotations: activation, valence, dominance, power, intensity.
SER Datasets - A collection of datasets for the purpose of emotion recognition/detection in speech.
SEWA - more than 2000 minutes of audio-visual data of 398 people (201 male and 197 female) coming from 6 cultures; emotions are characterized using valence and arousal.
ShEMO - 3000 semi-natural utterances, equivalent to 3 hours and 25 minutes of speech data from online radio plays by 87 native-Persian speakers; 6 emotions: anger, fear, happiness, sadness, neutral and surprise.
SparseLibriMix - An open source dataset for source separation in noisy environments and with variable overlap-ratio. Due to insufficient noise material this is a test-set-only version.
Speech Accent Archive - For various accent detection tasks.
Speech Commands Dataset - The dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website.
Spoken Commands dataset - A large database of free audio samples (10M words), a test bed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations. This is a really small set- about 10 MB in size.
Spoken Wikipeida Corpora - 38 GB in size available in both audio and without audio format.
Tatoeba - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
Ted-LIUM - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial).
TESS - 2800 recording by 2 actresses; 7 emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral.
Thorsten dataset - German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
TIMIT dataset - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay).
URDU-Dataset - 400 utterances by 38 speakers (27 male and 11 female); 4 emotions: angry, happy, neutral, and sad.
VCTK dataset - 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB.
VCTK-2Mix - VCTK-2Mix is an open source dataset for source separation in noisy environments. It is derived from VCTK signals and WHAM noise. It is meant as a test set. It will also enable cross-dataset experiments.
VIVAE - non-speech, 1085 audio file by ~12 speakers; non-speech 6 emotions: achievement, anger, fear, pain, pleasure, and surprise with 3 emotional intensities (low, moderate, strong, peak).
Voice Gender Detection - GitHub repo for Voice gender detection using the VoxCeleb dataset (7000+ unique speakers and utterances, 3683 males / 2312 females).
VOiCES Dataset - The Voices Obscured in Complex Environmental Settings (VOiCES) corpus is a creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.
VoxCeleb - VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from You Tube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions, and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.
VoxForge - VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines.
VoxPopuli - VoxPopuli provides 100K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, and 17.3K hours of speech-to-speech interpretation data for 16x15 directions.
WHAM! and WHAMR! - The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. WHAMR! is an extension to WHAM! that adds artificial reverberation to the speech signals in addition to the background noise. The noise audio was collected at various urban locations throughout the San Francisco Bay Area in late 2018. The environments primarily consist of restaurants, cafes, bars, and parks. Size of WHAM! dataset: 17.65 GB unzipping to 35 GB.
Zero Resource Speech Challenge - The ultimate goal of the Zero Resource Speech Challenge is to construct a system that learns an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only information available to a language learning infant. “Zero resource” refers to zero linguistic expertise (e.g., orthographic/linguistic transcriptions), not zero information besides audio (visual, limited human feedback, etc). The fact that 4-year-olds spontaneously learn a language without supervision from language experts show that this goal is theoretically reachable.

Audio events and music

AudioSet - An expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. Easily download AudioSet here.
Bird audio detection challenge - This challenge contained new datasets (5.4 GB) collected in real live bio-acoustics monitoring projects, and an objective, standardized evaluation framework.
Environmental audio dataset - Audio data collection and manual data annotation both are tedious processes, and lack of proper development dataset limits fast development in the environmental audio research.
Free Music Archive - FMA is a dataset for music analysis. 1000 GB in size.
Freesound dataset - 678,511 candidate annotations that express the potential presence of sound sources in audio clips. See https://annotator.freesound.org/ and https://annotator.freesound.org/fsd/explore/ for more information.
Karoldvl-ESC - The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
Million Song Dataset - The Million Song Dataset is a freely-available collection of audio features and meta-data for a million contemporary popular music tracks. 280 GB in size.
MUSDB18 - Multi-track music dataset for music source separation. 150 tracks (22 Gb).
Public domain sounds - Good for wake word detection; a wide array of sounds that can be used for object detection research (524 MB - 635 SOUNDS - Open for public use).
RSC Sounds - RSC sounds from RuneScape Classic (8-bit, u-law encoded, 8000 Hz pcm samples).
Urban Sound Dataset - two datasets and a taxonomy for urban sound research.

Learn more

Any feedback this repository is greatly appreciated.

Suggest a new dataset to add in the using this link.
If you want to learn more about voice computing, check out Voice Computing in Python book.
If you are looking for a framework to start building machine learning models in voice computing, check out Allie.
If you want to talk to me directly or be mentored, please send me an email @ js@neurolex.co.

JustKowalski/voice_datasets

voice_datasets

Audio datasets

Speech datasets

Audio events and music

Learn more