/open-speech-corpora

A list of accessible speech corpora for ASR.

Open ASR Corpora

A list of open corpora for Automatic Speech Recognition research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CommonVoice English English 582 hours (validated); 803 hours (total) 33,541 speakers (reported: 10% female / 41% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice German German 140 hours (validated); 146 hours (total) 2,249 speakers (reported: 5% female / 76% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice French French 74 hours (validated); 79 hours (total) 1,697 speakers (reported: 7% female / 72% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Welsh Welsh 21 hours (validated); 22 hours (total) 365 speakers (reported: 26% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Breton Breton 2 hours (validated); 7 hours (total) 82 speakers (reported: 2% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chuvash Chuvash <1 hour (validated); 2 hours (total) 33 speakers (reported: 0% female / 46% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Turkish Turkish 5 hours (validated); 6 hours (total) 203 speakers (reported: 7% female / 75% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Tatar Tatar 20 hours (validated); 20 hours (total) 117 speakers (reported: 2% female / 80% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kyrgyz Kyrgyz 5 hours (validated); 6 hours (total) 63 speakers (reported: 6% female / 80% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Irish Irish 1 hour (validated); 1 hour (total) 30 speakers (reported: 22% female / 57% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kabyle Kabyle 92 hours (validated); 98 hours (total) 382 speakers (reported: 17% female / 53% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Catalan Catalan 92 hours (validated); 98 hours (total) 1,639 speakers (reported: 44% female / 38% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chinese (Taiwan) Mandarin (Taiwan) 19 hours (validated); 28 hours (total) 695 speakers (reported: 35% female / 38% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Slovenian Slovenian 1 hour (validated); 3 hours (total) 18 speakers (reported: 17% female / 82% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Italian Italian 15 hours (validated); 19 hours (total) 313 speakers (reported: 7% female / 67% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Dutch Dutch 12 hours (validated); 13 hours (total) 373 speakers (reported: 2% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Hakha Chin Hakha Chin 2 hours (validated); 4 hours (total) 253 speakers (reported: 22% female / 26% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Esperanto Esperanto 4 hours (validated); 6 hours (total) 53 speakers (reported: 10% female / 21% male) https://voice.mozilla.org/en/datasets CC-0
Yesno Hebrew 6 mins one male http://www.openslr.org/1/ CC-0
LJ Speech Corpus English ~24 hours one female https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 CC-0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Althingi Parliamentary Speech Corpus Icelandic 542 hours and 25 minutes 196 speakers http://www.malfong.is/index.php?dlid=73&lang=en CC-BY 4.0
Alþingisumræður Parliamentary Speech Corpus Icelandic ~21 hours http://www.malfong.is/index.php?dlid=8&lang=en CC-BY 3.0
Hjal Corpus Icelandic ~41,000 recordings 883 speakers http://www.malfong.is/index.php?dlid=5&lang=en CC-BY 3.0
The Malromur Corpus Icelandic 152 hours 563 speakers http://www.malfong.is/index.php?dlid=65&lang=en CC-BY 4.0
Telecooperation German Corpus for Kinect German ~35 hours ~180 speakers http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz CC-BY 2.0
African Speech Technology English-English Speech Corpus English ~21 hours https://repo.sadilar.org/handle/20.500.12185/283 CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus isiXhosa ~26 hours https://repo.sadilar.org/handle/20.500.12185/305 CC-BY 2.5 South Africa
NCHLT Afrikaans Afrikaans 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/280 CC-BY 3.0
NCHLT English English 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/274 CC-BY 3.0
NCHLT isiNdebele isiNdebele 56 hours 148 speakers (78 female / 70 male) https://repo.sadilar.org/handle/20.500.12185/272 CC-BY 3.0
NCHLT isiXhosa isiXhosa 56 hours 209 speakers (106 female / 103 male) https://repo.sadilar.org/handle/20.500.12185/279 CC-BY 3.0
NCHLT isiZulu isiZulu 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/275 CC-BY 3.0
NCHLT Sepedi Sepedi 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/270 CC-BY 3.0
NCHLT Sesotho Sesotho 56 hours 210 speakers (113 female / 97 male) https://repo.sadilar.org/handle/20.500.12185/278 CC-BY 3.0
NCHLT Setswana Setswana 56 hours 210 speakers (109 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/281 CC-BY 3.0
NCHLT Siswati Siswati 56 hours 197 speakers (96 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/271 CC-BY 3.0
NCHLT Tshivenda Tshivenda 56 hours 208 speakers (83 female / 125 male) https://repo.sadilar.org/handle/20.500.12185/276 CC-BY 3.0
NCHLT Xitsonga Xitsonga 56 hours 198 speakers (95 female/103 male) https://repo.sadilar.org/handle/20.500.12185/277 CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus Afrikaans; English; isiZulu; Sesotho 2 hours 5 mins 20 speakers https://repo.sadilar.org/handle/20.500.12185/445 CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus English 2 hours 7 mins https://repo.sadilar.org/handle/20.500.12185/448 CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus Afrikaans 4 hours one male https://repo.sadilar.org/handle/20.500.12185/442 CC-BY 3.0
LibriSpeech English ~1000 hours 2484 speakers (1201 female / 1283 male) http://www.openslr.org/12/ CC-BY 4.0
Zeroth-Korean Korean 52.8 hours 115 speakers http://www.openslr.org/40/ CC-BY 4.0
Speech Commands English 17.8 hours >1,000 speakers https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html CC-BY 4.0
ParlamentParla Catalan 320 hours https://www.openslr.org/59/ CC-BY 4.0
SIWIS French ~10 hours one female http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip CC-BY 4.0
VCTK English 44 hours 109 speakers http://datashare.is.ed.ac.uk/download/DS_10283_2651.zip CC-BY 4.0
LibriTTS English 586 hours 2,456 speakers (1,185 female / 1,271 male) http://www.openslr.org/60/ CC-BY 4.0
Augmented LibriSpeech Audio (English); Text (English, French) 236 hours https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset CC-BY 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Iban Iban 8 hours http://www.openslr.org/24/ https://github.com/sarahjuan/iban CC-BY-SA 2.0
Vystadial English; Czech 41 hours; 15 hours http://www.openslr.org/6/ CC-BY-SA 3.0 US
Free Spoken Digit Dataset English 2,000 isolated digits 4 speakers https://github.com/Jakobovski/free-spoken-digit-dataset CC-BY-SA 4.0
Google Javanese Javanese 296 hours 1019 speakers http://www.openslr.org/35/ CC-BY-SA 4.0
Google Nepali Nepali 165 hours 527 speakers http://www.openslr.org/54/ CC-BY-SA 4.0
Google Bengali Bengali 229 hours 508 speakers http://www.openslr.org/53/ CC-BY-SA 4.0
Google Sinhala Sinhala 224 hours 478 speakers http://www.openslr.org/52/ CC-BY-SA 4.0
Google Sundanese Sundanese 333 hours 542 speakers http://www.openslr.org/36/ CC-BY-SA 4.0
Spokend Wikipedia Corpus (SWC-2017) English; German; Dutch 182 hours; 249 hours; 79 hours 395 speakers; 339 speakers; 145 speakers https://nats.gitlab.io/swc/ CC-BY-SA 4.0
Chuvash TTS Chuvash 4 hours 1 speaker https://github.com/ftyers/Turkic_TTS CC-BY-SA 4.0
Forschergeist German 2 hours 2 speakers (1 female; 1 male) female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz CC-BY-SA 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
IBM Recorded Debates v1 English 5 hours 10 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
IBM Recorded Debates v2 English ~14 hours 14 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
TV3Parla Catalan 240 hours http://laklak.eu/share/tv3_0.3.tar.gz CC-BY-NC 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CHiME-Home English 6.8 hours https://archive.org/details/chime-home CC-BY-NC-SA 3.0
Cameroon Pidgin English Corpus Cameroon Pidgin English ~17 hours http://ota.ox.ac.uk/text/2563.zip CC-BY-NC-SA 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Tatoeba-Eng English ~250 hours (rough estimate) 6 speakers https://voice.mozilla.org/en/datasets CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text)
TED-LIUM English 118 hours 685 speakers (36h female / 81h male) http://www.openslr.org/7/ CC-BY-NC-ND 3.0
TED-LIUM-2 English 207 hours 1242 speakers (66h female / 141h male) http://www.openslr.org/19/ CC-BY-NC-ND 3.0
TED-LIUM-3 English 452 hours 2028 speakers (134h female / 316h male) http://www.openslr.org/51/ CC-BY-NC-ND 3.0
Pansori TEDxKR Korean 3 hours 41 speakers http://www.openslr.org/58/ CC-BY-NC-ND 4.0
Primewords Mandarin Mandarin 100 hours 296 speakers http://www.openslr.org/47/ CC-BY-NC-ND 4.0
MuST-C v1.0 Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair https://ict.fbk.eu/must-c-release-v1-0/ CC-BY-NC-ND 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
VoxForge English ~120 hours ~2966 speakers http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets GNU-GPL 3.0
VoxForge Russian http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
VoxForge German http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
AISHELL-1 Mandarin 170 hours 400 speakers http://www.openslr.org/33/ Apache 2.0
Tunisian_MSA Modern Standard Arabic (Tunisia) 11.2 hours 118 speakers http://www.openslr.org/46/ Apache 2.0
African Accented French French 22 hours 232 speakers http://www.openslr.org/57/ Apache 2.0
THCHS-30 Mandarin Chinese 33.57 hours (13,389 utterances) 40 speakers (31 female; 9 male) http://www.openslr.org/18/ Apache 2.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ALFFA Amharic;Hausa (paid); Swahili; Wolof http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC MIT
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CMU Wilderness 700 Langs Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours https://github.com/festvox/datasets-CMU_Wilderness Questionable Legality: https://live.bible.is/terms
CHiME-5 English 50 hours 48 speakers http://spandh.dcs.shef.ac.uk/chime_challenge/data.html CHiME-5 License
FalaBrasil-LAPS-Constituicao Brazilian-Portuguese 9 hours 1 speaker https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail Brazilian-Portuguese 1 hour 25 speakers https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark Brazilian-Portuguese 1 hour 1 speaker https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
Fearless Steps Corpus English 19,000 hours (20 hours transcribed) ~450 speakers http://fearlesssteps.exploreapollo.org/
Microsoft Speech Corpus (Indian languages) Telugu; Tamil; Gujarati https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e Non-Commercial Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus English; Chinese; Japanese https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 Non-Commercial Microsoft Research Data License Agreement
Hey Snips Corpus English 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances 2215 speakers (positive & negative) and 4028 speakers (negative only) https://research.snips.ai/datasets/keyword-spotting Snips Data License
Snips SLU Corpus English; French 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances English: 69 speakers; French: 30 speakers https://research.snips.ai/datasets/spoken-language-understanding Snips Data License
M-AILABS German Corpus German 237 hours and 22 minutes http://www.caito.de/data/Training/stt_tts/de_DE.tgz M-AILABS LICENSE
M-AILABS Queen's English Corpus Queen's English 45 hours and 35 minutes http://www.caito.de/data/Training/stt_tts/en_UK.tgz M-AILABS LICENSE
M-AILABS US English Corpus American English 102 hours and 7 minutes http://www.caito.de/data/Training/stt_tts/en_US.tgz M-AILABS LICENSE
M-AILABS Spanish Corpus Spanish Spanish 108 hours and 34 minutes http://www.caito.de/data/Training/stt_tts/es_ES.tgz M-AILABS LICENSE
M-AILABS Italian Corpus Italian 127 hours and 40 minutes http://www.caito.de/data/Training/stt_tts/it_IT.tgz M-AILABS LICENSE
M-AILABS Ukrainian Corpus Ukrainian 87 hours and 8 minutes http://www.caito.de/data/Training/stt_tts/uk_UK.tgz M-AILABS LICENSE
M-AILABS Russian Corpus Russian 46 hours and 47 minutes http://www.caito.de/data/Training/stt_tts/ru_RU.tgz M-AILABS LICENSE
M-AILABS French-v0.9 Corpus French 190 hours and 30 minutes http://www.caito.de/data/Training/stt_tts/fr_FR.tgz M-AILABS LICENSE
M-AILABS Polish Corpus Polish 53 hours and 50 minutes http://www.caito.de/data/Training/stt_tts/pl_PL.tgz M-AILABS LICENSE
Fluent Speech Commands Corpus English 19 hours (30,043 utterances) 97 speakers http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz Fluent Speech Commands Public License