A list of open speech corpora for Speech Technology research and development.
This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.
Feel free to propse additions to the list!
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CommonVoice English | English | 582 hours (validated); 803 hours (total) | 33,541 speakers (reported: 10% female / 41% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice German | German | 140 hours (validated); 146 hours (total) | 2,249 speakers (reported: 5% female / 76% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice French | French | 74 hours (validated); 79 hours (total) | 1,697 speakers (reported: 7% female / 72% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Welsh | Welsh | 21 hours (validated); 22 hours (total) | 365 speakers (reported: 26% female / 43% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Breton | Breton | 2 hours (validated); 7 hours (total) | 82 speakers (reported: 2% female / 43% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chuvash | Chuvash | <1 hour (validated); 2 hours (total) | 33 speakers (reported: 0% female / 46% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Turkish | Turkish | 5 hours (validated); 6 hours (total) | 203 speakers (reported: 7% female / 75% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Tatar | Tatar | 20 hours (validated); 20 hours (total) | 117 speakers (reported: 2% female / 80% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kyrgyz | Kyrgyz | 5 hours (validated); 6 hours (total) | 63 speakers (reported: 6% female / 80% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Irish | Irish | 1 hour (validated); 1 hour (total) | 30 speakers (reported: 22% female / 57% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kabyle | Kabyle | 92 hours (validated); 98 hours (total) | 382 speakers (reported: 17% female / 53% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Catalan | Catalan | 92 hours (validated); 98 hours (total) | 1,639 speakers (reported: 44% female / 38% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chinese (Taiwan) | Mandarin (Taiwan) | 19 hours (validated); 28 hours (total) | 695 speakers (reported: 35% female / 38% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Slovenian | Slovenian | 1 hour (validated); 3 hours (total) | 18 speakers (reported: 17% female / 82% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Italian | Italian | 15 hours (validated); 19 hours (total) | 313 speakers (reported: 7% female / 67% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Dutch | Dutch | 12 hours (validated); 13 hours (total) | 373 speakers (reported: 2% female / 74% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Hakha Chin | Hakha Chin | 2 hours (validated); 4 hours (total) | 253 speakers (reported: 22% female / 26% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Esperanto | Esperanto | 4 hours (validated); 6 hours (total) | 53 speakers (reported: 10% female / 21% male) | https://voice.mozilla.org/en/datasets | CC-0 |
Yesno | Hebrew | 6 mins | one male | http://www.openslr.org/1/ | CC-0 |
LJ Speech Corpus | English | ~24 hours | one female | https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 | CC-0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Althingi Parliamentary Speech Corpus | Icelandic | 542 hours and 25 minutes | 196 speakers | http://www.malfong.is/index.php?dlid=73&lang=en | CC-BY 4.0 |
Alþingisumræður Parliamentary Speech Corpus | Icelandic | ~21 hours | http://www.malfong.is/index.php?dlid=8&lang=en | CC-BY 3.0 | |
Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers | http://www.malfong.is/index.php?dlid=5&lang=en | CC-BY 3.0 |
The Malromur Corpus | Icelandic | 152 hours | 563 speakers | http://www.malfong.is/index.php?dlid=65&lang=en | CC-BY 4.0 |
Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers | http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz | CC-BY 2.0 |
African Speech Technology English-English Speech Corpus | English | ~21 hours | https://repo.sadilar.org/handle/20.500.12185/283 | CC-BY 2.5 South Africa | |
African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | https://repo.sadilar.org/handle/20.500.12185/305 | CC-BY 2.5 South Africa | |
NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/280 | CC-BY 3.0 |
NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/274 | CC-BY 3.0 |
NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | https://repo.sadilar.org/handle/20.500.12185/272 | CC-BY 3.0 |
NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | https://repo.sadilar.org/handle/20.500.12185/279 | CC-BY 3.0 |
NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/275 | CC-BY 3.0 |
NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/270 | CC-BY 3.0 |
NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | https://repo.sadilar.org/handle/20.500.12185/278 | CC-BY 3.0 |
NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/281 | CC-BY 3.0 |
NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/271 | CC-BY 3.0 |
NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | https://repo.sadilar.org/handle/20.500.12185/276 | CC-BY 3.0 |
NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | https://repo.sadilar.org/handle/20.500.12185/277 | CC-BY 3.0 |
Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins | 20 speakers | https://repo.sadilar.org/handle/20.500.12185/445 | CC-BY 3.0 |
Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | https://repo.sadilar.org/handle/20.500.12185/448 | CC-BY 3.0 | |
Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | https://repo.sadilar.org/handle/20.500.12185/442 | CC-BY 3.0 |
LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | http://www.openslr.org/12/ | CC-BY 4.0 |
Zeroth-Korean | Korean | 52.8 hours | 115 speakers | http://www.openslr.org/40/ | CC-BY 4.0 |
Speech Commands | English | 17.8 hours | >1,000 speakers | https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html | CC-BY 4.0 |
ParlamentParla | Catalan | 320 hours | https://www.openslr.org/59/ | CC-BY 4.0 | |
SIWIS | French | ~10 hours | one female | http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip | CC-BY 4.0 |
VCTK | English | 44 hours | 109 speakers | http://datashare.is.ed.ac.uk/download/DS_10283_2651.zip | CC-BY 4.0 |
LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male) | http://www.openslr.org/60/ | CC-BY 4.0 |
Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset | CC-BY 4.0 | |
Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers | https://github.com/Helsinki-NLP/prosody | CC-BY 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Iban | Iban | 8 hours | http://www.openslr.org/24/ https://github.com/sarahjuan/iban | CC-BY-SA 2.0 | |
Vystadial | English; Czech | 41 hours; 15 hours | http://www.openslr.org/6/ | CC-BY-SA 3.0 US | |
Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | https://github.com/Jakobovski/free-spoken-digit-dataset | CC-BY-SA 4.0 |
Google Javanese | Javanese | 296 hours | 1019 speakers | http://www.openslr.org/35/ | CC-BY-SA 4.0 |
Google Nepali | Nepali | 165 hours | 527 speakers | http://www.openslr.org/54/ | CC-BY-SA 4.0 |
Google Bengali | Bengali | 229 hours | 508 speakers | http://www.openslr.org/53/ | CC-BY-SA 4.0 |
Google Sinhala | Sinhala | 224 hours | 478 speakers | http://www.openslr.org/52/ | CC-BY-SA 4.0 |
Google Sundanese | Sundanese | 333 hours | 542 speakers | http://www.openslr.org/36/ | CC-BY-SA 4.0 |
Spokend Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | https://nats.gitlab.io/swc/ | CC-BY-SA 4.0 |
Chuvash TTS | Chuvash | 4 hours | 1 speaker | https://github.com/ftyers/Turkic_TTS | CC-BY-SA 4.0 |
Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz | CC-BY-SA 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
IBM Recorded Debates v1 | English | 5 hours | 10 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
TV3Parla | Catalan | 240 hours | http://laklak.eu/share/tv3_0.3.tar.gz | CC-BY-NC 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CHiME-Home | English | 6.8 hours | https://archive.org/details/chime-home | CC-BY-NC-SA 3.0 | |
Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours | http://ota.ox.ac.uk/text/2563.zip | CC-BY-NC-SA 3.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | https://voice.mozilla.org/en/datasets | CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text) |
TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | http://www.openslr.org/7/ | CC-BY-NC-ND 3.0 |
TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | http://www.openslr.org/19/ | CC-BY-NC-ND 3.0 |
TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | http://www.openslr.org/51/ | CC-BY-NC-ND 3.0 |
Pansori TEDxKR | Korean | 3 hours | 41 speakers | http://www.openslr.org/58/ | CC-BY-NC-ND 4.0 |
Primewords Mandarin | Mandarin | 100 hours | 296 speakers | http://www.openslr.org/47/ | CC-BY-NC-ND 4.0 |
MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | https://ict.fbk.eu/must-c-release-v1-0/ | CC-BY-NC-ND 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
VoxForge | English | ~120 hours | ~2966 speakers | http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets | GNU-GPL 3.0 |
VoxForge | Russian | http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ | GNU-GPL 3.0 | ||
VoxForge | German | http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ | GNU-GPL 3.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
AISHELL-1 | Mandarin | 170 hours | 400 speakers | http://www.openslr.org/33/ | Apache 2.0 |
Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | http://www.openslr.org/46/ | Apache 2.0 |
African Accented French | French | 22 hours | 232 speakers | http://www.openslr.org/57/ | Apache 2.0 |
THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) | http://www.openslr.org/18/ | Apache 2.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
ALFFA | Amharic;Hausa (paid); Swahili; Wolof | http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC | MIT |