voice_datasets

A comprehensive list of open source voice and music datasets. I released this for the talk @ the VOICE Summit 2019. If you are looking to engineer your own voice dataset, check out https://surveylex.com/research

Audio datasets

There are two main types of audio datasets: speech datasets and audio event/music datasets.

Speech datasets

2000 HUB5 English - The Hub5 evaluation series focused on conversational speech over the telephone with the particular task of transcribing conversational speech into text. Its goals were to explore promising new areas in the recognition of conversational speech, to develop advanced technology incorporating those ideas and to measure the performance of new technology.
Arabic Speech Corpus - The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes.
ASR datasets - A list of publically available audio data that anyone can download for ASR or other speech activities
AudioMNIST - The dataset consists of 30000 audio samples of spoken digits (0-9) of 60 different speakers
Common Voice - Common Voice is Mozilla's initiative to help teach machines how real people speak. 12GB in size; spoken text based on text from a number of public domain sources like user-submitted blog posts, old books, movies, and other public speech corpora.
CHIME - This is a noisy speech recognition challenge dataset (~4GB in size). The dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
CMU Wilderness - (noncommercial) - not available but a great speech dataset many accents reciting passages from the Bible.
Emotional Voices Database - various emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).
Emotional Voice dataset - Nature - 2,519 speech samples produced by 100 actors from 5 cultures. With large-scale statistical inference methods, we find that prosody can communicate at least 12 distinct kinds of emotion that are preserved across the 2 cultures.
Free Spoken Digit Dataset -4 speakers, 2,000 recordings (50 of each digit per speaker), English pronunciations.
Flickr Audio Caption - 40,000 spoken captions of 8,000 natural images, 4.2 GB in size.
ISOLET Data Set - This 38.7 GB dataset helps predict which letter-name was spoken — a simple classification task.
Librispeech - LibriSpeech is a corpus of approximately 1000 hours of 16Khz read English speech derived from read audiobooks from the LibriVox project.
LJ Speech - This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Multimodal EmotionLines Dataset (MELD) - Multimodal EmotionLines Dataset (MELD) has been created by enhancing and extending EmotionLines dataset. MELD contains the same dialogue instances available in EmotionLines, but it also encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Each utterance in a dialogue has been labeled with— Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear.
Noisy Dataset- Clean and noisy parallel speech database. The database was designed to train and test speech enhancement methods that operate at 48kHz.
Parkinson's speech dataset - The training data belongs to 20 Parkinson’s Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken for this 20 MB set.
Persian Consonant Vowel Combination (PCVC) Speech Dataset - The Persian Consonant Vowel Combination (PCVC) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. This dataset contains 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples.
Speech Accent Archive - For various accent detection tasks.
Speech Commands Dataset - The dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website.
Spoken Commands dataset - A large database of free audio samples (10M words), a test bed for voice activity detection algorithms and for recognition of syllables (single-word commands). 3 speakers, 1,500 recordings (50 of each digit per speaker), English pronunciations. This is a really small set- about 10 MB in size.
Spoken Wikipeida Corpora - 38 GB in size available in both audio and without audio format.
Tatoeba - Tatoeba is a large database of sentences, translations, and spoken audio for use in language learning. This download contains spoken English recorded by their community.
Ted-LIUM - The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website (noncommercial).
TIMIT dataset - TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. It includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance (have to pay).
VoxCeleb - VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from You Tube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions, and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.
VoxForge - VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines.
Zero Resource Speech Challenge - The ultimate goal of the Zero Resource Speech Challenge is to construct a system that learns an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only information available to a language learning infant. “Zero resource” refers to zero linguistic expertise (e.g., orthographic/linguistic transcriptions), not zero information besides audio (visual, limited human feedback, etc). The fact that 4-year-olds spontaneously learn a language without supervision from language experts show that this goal is theoretically reachable.

Audio events and music

AudioSet - An expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.
Bird audio detection challenge - This challenge contained new datasets (5.4 GB) collected in real live bio-acoustics monitoring projects, and an objective, standardized evaluation framework.
Environmental audio dataset - Audio data collection and manual data annotation both are tedious processes, and lack of proper development dataset limits fast development in the environmental audio research.
Free Music Archive - FMA is a dataset for music analysis. 1000 GB in size.
Freesound dataset - many different sound events. https://annotator.freesound.org/ and https://annotator.freesound.org/fsd/explore/ - The AudioSet Ontology is a hierarchical collection of over 600 sound classes and we have filled them with 297,159 audio samples from Freesound. This process generated 678,511 candidate annotations that express the potential presence of sound sources in audio clips.
Karoldvl-ESC - The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.
Million Song Dataset - The Million Song Dataset is a freely-available collection of audio features and meta-data for a million contemporary popular music tracks. 280 GB in size.
Urban Sound Dataset - two datasets and a taxonomy for urban sound research.

Learn more

Any feedback this repository is greatly appreciated.

If you want to learn more about voice computing, check out Voice Computing in Python book.
If you'd like to be mentored by someone on our team, check out the Innovation Fellows Program.
If you want to talk to me directly, please send me an email @ js@neurolex.co.

AIRob/voice_datasets

voice_datasets

Audio datasets

Speech datasets

Audio events and music

Learn more