voice_datasets

A comprehensive list of open source voice and music datasets. I released this for the talk @ the VOICE Summit 2019. If you are looking to engineer your own voice dataset, check out https://surveylex.com/research

Audio datasets

There are two main types of audio datasets: speech datasets and audio event/music datasets.

Speech datasets

Arabic Speech Corpus - The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes.
ASR datasets - A list of publically available audio data that anyone can download for ASR or other speech activities
AudioMNIST - The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
Awesome_Diarization - A curated list of awesome Speaker Diarization papers, libraries, datasets, and other resources.
Common Voice - Common Voice is Mozilla's initiative to help teach machines how real people speak. 12GB in size; spoken text based on text from a number of public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora.
CHIME - This is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
CMU Wilderness - (noncommercial) - not available but a great speech dataset many accents reciting passages from the Bible.
CREMA-D - CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African America, Asian, Caucasian, Hispanic, and Unspecified).
DAPS Dataset - DAPS consists of 20 speakers (10 female and 10 male) reading 5 excerpts each from public domain books (which provides about 14 minutes of data per speaker).
Deep Clustering Dataset - Training deep discriminative embeddings to solve the cocktail party problem.
DIPCO - Dinner Party Corpus - The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes.
Emotional Voices Database - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).
Emotional Voice dataset - Nature - 2,519 speech samples produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
Free Spoken Digit Dataset -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations.
Flickr Audio Caption - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size.
ISOLET Data Set - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.
Libri-CSS - derived from LibriSpeech by concatenating the corpus utterances to simulate a conversation and capturing the audio replays with far-field microphones.
LibriMix - LibriMix is an open source dataset for source separation in noisy environments. It is derived from LibriSpeech signals (clean subset) and WHAM noise. It offers a free alternative to the WHAM dataset and complements it. It will also enable cross-dataset experiments.
Librispeech - LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
LJ Speech - This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Microsoft Scalable Noisy Speech Dataset - The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.
Multimodal EmotionLines Dataset (MELD) - Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Each utterance in a dialogue has been labeled with— Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear.
Noisy Dataset- Clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz. Also known as VBD, Voice Bank + DEMAND. Speech samples from VCTK dataset.
Parkinson's speech dataset - The training data belongs to 20 Parkinson’s Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken for this 20 MB set.
Persian Consonant Vowel Combination (PCVC) Speech Dataset - The Persian Consonant Vowel Combination (PCVC) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) - The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions.
sample_voice_data - 52 audio files per class (males and females) for testing purposes.
SAVEE Dataset - 4 male actors in 7 different emotions, 480 British English utterances in total.
SparseLibriMix - An open source dataset for source separation in noisy environments and with variable overlap-ratio. Due to insufficient noise material this is a test-set-only version.
Speech Accent Archive - For various accent detection tasks.
Speech Commands Dataset - The dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website.
Spoken Commands dataset - A large database of free audio samples (10M words), a test bed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations. This is a really small set- about 10 MB in size.
Spoken Wikipeida Corpora - 38 GB in size available in both audio and without audio format.
Tatoeba - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
Ted-LIUM - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial).
Thorsten dataset - German language dataset, 22,668 recorded phrases, 23 hours of audio, phrase length 52 characters on average.
TIMIT dataset - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay).
VCTK dataset - 110 English speakers with various accents; each speaker reads out about 400 sentences. Samples are mostly 2–6 s long, at 48 kHz 16 bits, for a total dataset size of ~10 GiB.
VCTK-2Mix - VCTK-2Mix is an open source dataset for source separation in noisy environments. It is derived from VCTK signals and WHAM noise. It is meant as a test set. It will also enable cross-dataset experiments.
Voice Gender Detection - GitHub repo for Voice gender detection using the VoxCeleb dataset (7000+ unique speakers and utterances, 3683 males / 2312 females).
VoxCeleb - VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from You Tube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions, and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.
VoxForge - VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines.
VoxPopuli - VoxPopuli provides 100K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, and 17.3K hours of speech-to-speech interpretation data for 16x15 directions.
WHAM! and WHAMR! - The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset pairs each two-speaker mixture in the wsj0-2mix dataset with a unique noise background scene. WHAMR! is an extension to WHAM! that adds artificial reverberation to the speech signals in addition to the background noise. The noise audio was collected at various urban locations throughout the San Francisco Bay Area in late 2018. The environments primarily consist of restaurants, cafes, bars, and parks. Size of WHAM! dataset: 17.65 GB unzipping to 35 GB.
Zero Resource Speech Challenge - The ultimate goal of the Zero Resource Speech Challenge is to construct a system that learns an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only information available to a language learning infant. “Zero resource” refers to zero linguistic expertise (e.g., orthographic/linguistic transcriptions), not zero information besides audio (visual, limited human feedback, etc). The fact that 4-year-olds spontaneously learn a language without supervision from language experts show that this goal is theoretically reachable.

Audio events and music

AudioSet - An expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. Easily download AudioSet here.
Bird audio detection challenge - This challenge contained new datasets (5.4 GB) collected in real live bio-acoustics monitoring projects, and an objective, standardized evaluation framework.
Environmental audio dataset - Audio data collection and manual data annotation both are tedious processes, and lack of proper development dataset limits fast development in the environmental audio research.
Free Music Archive - FMA is a dataset for music analysis. 1000 GB in size.
Freesound dataset - 678,511 candidate annotations that express the potential presence of sound sources in audio clips. See https://annotator.freesound.org/ and https://annotator.freesound.org/fsd/explore/ for more information.
Karoldvl-ESC - The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
Million Song Dataset - The Million Song Dataset is a freely-available collection of audio features and meta-data for a million contemporary popular music tracks. 280 GB in size.
MUSDB18 - Multi-track music dataset for music source separation. 150 tracks (22 Gb).
Public domain sounds - Good for wake word detection; a wide array of sounds that can be used for object detection research (524 MB - 635 SOUNDS - Open for public use).
RSC Sounds - RSC sounds from RuneScape Classic (8-bit, u-law encoded, 8000 Hz pcm samples).
Urban Sound Dataset - two datasets and a taxonomy for urban sound research.

Learn more

Any feedback this repository is greatly appreciated.

Suggest a new dataset to add in the using this link.
If you want to learn more about voice computing, check out Voice Computing in Python book.
If you are looking for a framework to start building machine learning models in voice computing, check out Allie.
If you'd like to be mentored by someone on our team, check out the Innovation Fellows Program.
If you want to talk to me directly, please send me an email @ js@neurolex.co.

kuielab/voice_datasets

voice_datasets

Audio datasets

Speech datasets

Audio events and music

Learn more