/open-speech-corpora

A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

MIT LicenseMIT

Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Common Voice English English 1,118 hours (validated); 1,488 hours (total) 51,072 speakers (reported: 13% female / 46% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice German German 483 hours (validated); 538 hours (total) 8,460 speakers (reported: 9% female / 67% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice French French 350 hours (validated); 412 hours (total) 8,164 speakers (reported: 12% female / 65% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Welsh Welsh 59 hours (validated); 77 hours (total) 1,149 speakers (reported: 18% female / 29% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Breton Breton 5 hours (validated); 12 hours (total) 133 speakers (reported: 2% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Chuvash Chuvash <1 hour (validated); 2 hours (total) 38 speakers (reported: 0% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Turkish Turkish 13 hours (validated); 14 hours (total) 461 speakers (reported: 8% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Tatar Tatar 25 hours (validated); 27 hours (total) 142 speakers (reported: 2% female / 81% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Kyrgyz Kyrgyz 11 hours (validated); 21 hours (total) 119 speakers (reported: 44% female / 45% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Irish Irish 2 hour (validated); 4 hour (total) 80 speakers (reported: 16% female / 59% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Kabyle Kabyle 262 hours (validated); 276 hours (total) 693 speakers (reported: 22% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Catalan Catalan 245 hours (validated); 295 hours (total) 3,724 speakers (reported: 35% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Taiwanese Mandarin Taiwanese Mandarin 42 hours (validated); 60 hours (total) 1,108 speakers (reported: 26% female / 48% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Slovenian Slovenian 3 hour (validated); 6 hours (total) 51 speakers (reported: 16% female / 80% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Italian Italian 85 hours (validated); 122 hours (total) 4,292 speakers (reported: 18% female / 47% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Dutch Dutch 24 hours (validated); 33 hours (total) 701 speakers (reported: 10% female / 66% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Hakha Chin Hakha Chin 2 hours (validated); 5 hours (total) 290 speakers (reported: 20% female / 23% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Esperanto Esperanto 35 hours (validated); 41 hours (total) 215 speakers (reported: 7% female / 70% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Estonian Estonian 10 hours (validated); 13 hours (total) 230 speakers (reported: 38% female / 57% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Persian Persian 211 hours (validated); 255 hours (total) 2,763 speakers (reported: 6% female / 78% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Basque Basque 65 hours (validated); 99 hours (total) 638 speakers (reported: 23% female / 51% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Spanish Spanish 167 hours (validated); 221 hours (total) 8,252 speakers (reported: 10% female / 55% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Mandarin Mandarin (China) 26 hours (validated); 31 hours (total) 963 speakers (reported: 10% female / 64% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Mongolian Mongolian 9 hours (validated); 12 hours (total) 296 speakers (reported: 25% female / 36% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Sakha Sakha 3 hours (validated); 6 hours (total) 37 speakers (reported: 10% female / 54% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Dhivehi Dhivehi 6 hours (validated); 8 hours (total) 101 speakers (reported: 64% female / 28% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Kinyarwanda Kinyarwanda <1 hours (validated); 17 hours (total) 129 speakers (reported: 8% female / 41% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Swedish Swedish 5 hours (validated); 6 hours (total) 99 speakers (reported: 8% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Russian Russian 72 hours (validated); 76 hours (total) 496 speakers (reported: 23% female / 71% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Indonesian Indonesian 3 hours (validated); 3 hours (total) 56 speakers (reported: 4% female / 82% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Arabic Arabic 7 hours (validated); 12 hours (total) 228 speakers (reported: 24% female / 48% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Tamil Tamil 3 hours (validated); 4 hours (total) 91 speakers (reported: 10% female / 67% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Interlingua Interlingua 1 hours (validated); 3 hours (total) 12 speakers (reported: 2% female / 94% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Portuguese Portuguese 27 hours (validated); 29 hours (total) 354 speakers (reported: 2% female / 89% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Latvian Latvian 4 hours (validated); 6 hours (total) 86 speakers (reported: 17% female / 64% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Japanese Japanese 3 hours (validated); 3 hours (total) 52 speakers (reported: 0% female / 81% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Votic Votic <1 hours (validated); <1 hours (total) 2 speakers (reported: 0% female / 0% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Abkhaz Abkhaz <1 hours (validated); <1 hours (total) 3 speakers (reported: 2% female / 98% male) https://voice.mozilla.org/en/datasets CC-0
Common Voice Chinese (Hong Kong) Chinese (Hong Kong) <1 hours (validated); <1 hours (total) 15 speakers (reported: 24% female / 37% male) https://voice.mozilla.org/en/datasets CC-0
Yesno Hebrew 6 mins one male http://www.openslr.org/1/ CC-0
LJ Speech Corpus English ~24 hours one female https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 CC-0
NST Danish ASR Database Danish 229,992 utterances 616 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/ CC-0
NST Danish Dictation Danish 34,955 utterances 151 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/ CC-0
NST Danish Speech Synthesis Danish 4,108 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/ CC-0
NST Swedish ASR Database Swedish 366,000 utterances 1,000 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/ CC-0
NST Swedish Dictation Swedish 45,620 utterances 195 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/ CC-0
NST Swedish Speech Synthesis Swedish 5,279 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/ CC-0
NST Norwegian ASR Database Norwegian 359,760 utterances 980 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/ CC-0
NST Norwegian Dictation Norwegian 33,360 utterances 144 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/ CC-0
NST Norwegian Speech Synthesis Norwegian 5,363 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/ CC-0
NB Tale – Speech Database for Norwegian Norwegian 7,600 utterances + ~12 hours 380 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/ CC-0
Norwegian Parliamentary Speech Corpus (v0.1) Norwegian ~59 hours 203 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/ CC-0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ARU Speech Corpus English (UK) 720 utterances / speaker 12 (6 femals; 6 male) http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip CC-BY 3.0
Althingi Parliamentary Speech Corpus Icelandic 542 hours and 25 minutes 196 speakers http://www.malfong.is/index.php?dlid=73&lang=en CC-BY 4.0
Alþingisumræður Parliamentary Speech Corpus Icelandic ~21 hours http://www.malfong.is/index.php?dlid=8&lang=en CC-BY 3.0
Hjal Corpus Icelandic ~41,000 recordings 883 speakers http://www.malfong.is/index.php?dlid=5&lang=en CC-BY 3.0
The Malromur Corpus Icelandic 152 hours 563 speakers http://www.malfong.is/index.php?dlid=65&lang=en CC-BY 4.0
Telecooperation German Corpus for Kinect German ~35 hours ~180 speakers http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz CC-BY 2.0
African Speech Technology English-English Speech Corpus English ~21 hours https://repo.sadilar.org/handle/20.500.12185/283 CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus isiXhosa ~26 hours https://repo.sadilar.org/handle/20.500.12185/305 CC-BY 2.5 South Africa
NCHLT Afrikaans Afrikaans 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/280 CC-BY 3.0
NCHLT English English 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/274 CC-BY 3.0
NCHLT isiNdebele isiNdebele 56 hours 148 speakers (78 female / 70 male) https://repo.sadilar.org/handle/20.500.12185/272 CC-BY 3.0
NCHLT isiXhosa isiXhosa 56 hours 209 speakers (106 female / 103 male) https://repo.sadilar.org/handle/20.500.12185/279 CC-BY 3.0
NCHLT isiZulu isiZulu 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/275 CC-BY 3.0
NCHLT Sepedi Sepedi 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/270 CC-BY 3.0
NCHLT Sesotho Sesotho 56 hours 210 speakers (113 female / 97 male) https://repo.sadilar.org/handle/20.500.12185/278 CC-BY 3.0
NCHLT Setswana Setswana 56 hours 210 speakers (109 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/281 CC-BY 3.0
NCHLT Siswati Siswati 56 hours 197 speakers (96 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/271 CC-BY 3.0
NCHLT Tshivenda Tshivenda 56 hours 208 speakers (83 female / 125 male) https://repo.sadilar.org/handle/20.500.12185/276 CC-BY 3.0
NCHLT Xitsonga Xitsonga 56 hours 198 speakers (95 female/103 male) https://repo.sadilar.org/handle/20.500.12185/277 CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus Afrikaans; English; isiZulu; Sesotho 2 hours 5 mins 20 speakers https://repo.sadilar.org/handle/20.500.12185/445 CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus English 2 hours 7 mins https://repo.sadilar.org/handle/20.500.12185/448 CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus Afrikaans 4 hours one male https://repo.sadilar.org/handle/20.500.12185/442 CC-BY 3.0
LibriSpeech English ~1000 hours 2484 speakers (1201 female / 1283 male) http://www.openslr.org/12/ CC-BY 4.0
Zeroth-Korean Korean 52.8 hours 115 speakers http://www.openslr.org/40/ CC-BY 4.0
Speech Commands English 17.8 hours >1,000 speakers https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html CC-BY 4.0
ParlamentParla Catalan 320 hours https://www.openslr.org/59/ CC-BY 4.0
SIWIS French ~10 hours one female http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip CC-BY 4.0
VCTK English 44 hours 109 speakers http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip CC-BY 4.0
LibriTTS English 586 hours 2,456 speakers (1,185 female / 1,271 male) http://www.openslr.org/60/ CC-BY 4.0
Augmented LibriSpeech Audio (English); Text (English, French) 236 hours https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91 CC-BY 4.0
Helsinki Prosody Corpus English 262.5 hours 1,230 speakers https://github.com/Helsinki-NLP/prosody CC-BY 4.0
Tuva Speech Database Norwegian 24 hours 40 speakers https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= CC-BY 4.0
COERLL Kʼicheʼ corpus Kʼicheʼ 34 minutes ? speakers https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz CC-BY 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Iban Iban 8 hours http://www.openslr.org/24/ https://github.com/sarahjuan/iban CC-BY-SA 2.0
Vystadial English; Czech 41 hours; 15 hours http://www.openslr.org/6/ CC-BY-SA 3.0 US
Free Spoken Digit Dataset English 2,000 isolated digits 4 speakers https://github.com/Jakobovski/free-spoken-digit-dataset CC-BY-SA 4.0
Google Javanese Javanese 296 hours 1019 speakers http://www.openslr.org/35/ CC-BY-SA 4.0
Google Nepali Nepali 165 hours 527 speakers http://www.openslr.org/54/ CC-BY-SA 4.0
Google Bengali Bengali 229 hours 508 speakers http://www.openslr.org/53/ CC-BY-SA 4.0
Google Sinhala Sinhala 224 hours 478 speakers http://www.openslr.org/52/ CC-BY-SA 4.0
Google Sundanese Sundanese 333 hours 542 speakers http://www.openslr.org/36/ CC-BY-SA 4.0
Spoken Wikipedia Corpus (SWC-2017) English; German; Dutch 182 hours; 249 hours; 79 hours 395 speakers; 339 speakers; 145 speakers https://nats.gitlab.io/swc/ CC-BY-SA 4.0
Chuvash TTS Chuvash 4 hours 1 speaker https://github.com/ftyers/Turkic_TTS CC-BY-SA 4.0
Forschergeist German 2 hours 2 speakers (1 female; 1 male) female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz CC-BY-SA 4.0
Malayalam Speech Corpus by SMC Malayalam 1:36 hours 75 speakers (3 female, 12 male, 60 unidentified) https://releases.smc.org.in/msc-reviewed-speech/ CC-BY-SA 4.0
Google Malayalam Malayalam 3.02 hours 24 speakers http://www.openslr.org/63/ CC-BY-SA 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
IBM Recorded Debates v1 English 5 hours 10 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
IBM Recorded Debates v2 English ~14 hours 14 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
TV3Parla Catalan 240 hours http://laklak.eu/share/tv3_0.3.tar.gz CC-BY-NC 4.0
Russian Open STT Corpus Russian ~10,000 hours public, ~10,000 more upon request https://github.com/snakers4/open_stt/#links CC-BY-NC 4.0 with some exceptions
Russian Open TTS Corpus Russian 145 hours 3 males https://github.com/snakers4/open_tts/#links CC-BY-NC 4.0 with some expections
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CHiME-Home English 6.8 hours https://archive.org/details/chime-home CC-BY-NC-SA 3.0
Cameroon Pidgin English Corpus Cameroon Pidgin English ~17 hours http://ota.ox.ac.uk/text/2563.zip CC-BY-NC-SA 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Tatoeba-Eng English ~250 hours (rough estimate) 6 speakers https://voice.mozilla.org/en/datasets CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text)
TED-LIUM English 118 hours 685 speakers (36h female / 81h male) http://www.openslr.org/7/ CC-BY-NC-ND 3.0
TED-LIUM-2 English 207 hours 1242 speakers (66h female / 141h male) http://www.openslr.org/19/ CC-BY-NC-ND 3.0
TED-LIUM-3 English 452 hours 2028 speakers (134h female / 316h male) http://www.openslr.org/51/ CC-BY-NC-ND 3.0
Pansori TEDxKR Korean 3 hours 41 speakers http://www.openslr.org/58/ CC-BY-NC-ND 4.0
Primewords Mandarin Mandarin 100 hours 296 speakers http://www.openslr.org/47/ CC-BY-NC-ND 4.0
MuST-C v1.0 Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair https://ict.fbk.eu/must-c-release-v1-0/ CC-BY-NC-ND 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
DiPCo English ~5 hours 32 speakers (13 female; 19 male) https://s3.amazonaws.com/dipco/DiPCo.tgz CDLA-Permissive-1.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
VoxForge English ~120 hours ~2966 speakers http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets GNU-GPL 3.0
VoxForge Russian http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
VoxForge German http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
AISHELL-1 Mandarin 170 hours 400 speakers http://www.openslr.org/33/ Apache 2.0
Tunisian_MSA Modern Standard Arabic (Tunisia) 11.2 hours 118 speakers http://www.openslr.org/46/ Apache 2.0
African Accented French French 22 hours 232 speakers http://www.openslr.org/57/ Apache 2.0
THCHS-30 Mandarin Chinese 33.57 hours (13,389 utterances) 40 speakers (31 female; 9 male) http://www.openslr.org/18/ Apache 2.0
Living Audio Dataset - Dutch Dutch 57:49 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - English English 50:50 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - Irish Irish 61:56 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - Russian Russian 34:58 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ALFFA Amharic;Hausa (paid); Swahili; Wolof http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC MIT
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
M-AILABS German Corpus German 237 hours and 22 minutes http://www.caito.de/data/Training/stt_tts/de_DE.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Queen's English Corpus Queen's English 45 hours and 35 minutes http://www.caito.de/data/Training/stt_tts/en_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS US English Corpus American English 102 hours and 7 minutes http://www.caito.de/data/Training/stt_tts/en_US.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Spanish Corpus Spanish Spanish 108 hours and 34 minutes http://www.caito.de/data/Training/stt_tts/es_ES.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Italian Corpus Italian 127 hours and 40 minutes http://www.caito.de/data/Training/stt_tts/it_IT.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Ukrainian Corpus Ukrainian 87 hours and 8 minutes http://www.caito.de/data/Training/stt_tts/uk_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Russian Corpus Russian 46 hours and 47 minutes http://www.caito.de/data/Training/stt_tts/ru_RU.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS French-v0.9 Corpus French 190 hours and 30 minutes http://www.caito.de/data/Training/stt_tts/fr_FR.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Polish Corpus Polish 53 hours and 50 minutes http://www.caito.de/data/Training/stt_tts/pl_PL.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Fluent Speech Commands Corpus English 19 hours (30,043 utterances) 97 speakers http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz Fluent Speech Commands Public License
CMU Wilderness 700 Langs Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours https://github.com/festvox/datasets-CMU_Wilderness Questionable Legality: https://live.bible.is/terms
CHiME-5 English 50 hours 48 speakers http://spandh.dcs.shef.ac.uk/chime_challenge/data.html CHiME-5 License
FalaBrasil-LAPS-Constituicao Brazilian-Portuguese 9 hours 1 speaker https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail Brazilian-Portuguese 1 hour 25 speakers https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark Brazilian-Portuguese 1 hour 1 speaker https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
Fearless Steps Corpus English 19,000 hours (20 hours transcribed) ~450 speakers http://fearlesssteps.exploreapollo.org/
Microsoft Speech Corpus (Indian languages) Telugu; Tamil; Gujarati https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e Non-Commercial Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus English; Chinese; Japanese https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 Non-Commercial Microsoft Research Data License Agreement
Hey Snips Corpus English 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances 2215 speakers (positive & negative) and 4028 speakers (negative only) https://research.snips.ai/datasets/keyword-spotting Snips Data License
Snips SLU Corpus English; French 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances English: 69 speakers; French: 30 speakers https://research.snips.ai/datasets/spoken-language-understanding Snips Data License
CMU Sphinx Group - AN4 English "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz AN4
FT Speech Danish ~1,857 hours (1,017,244 utterances) 434 speakers (176 female, 258 male) https://ftspeech.dk FT Speech License