Datasets and scripts for basic natural language and speech processing.
This is not an official Google product.
Directory | Language Available |
---|---|
af | Afrikaans |
bn | Bengali / Bangla |
hi_ur | Hindi & Urdu |
is | Icelandic |
jv | Javanese |
km | Khmer |
lo | Lao |
my | Burmese / Myanmar |
ne | Nepali |
si | Sinhala |
su | Sundanese |
xh | Xhosa |
zu | Zulu |
We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).
Resource | Link |
---|---|
Sinhala TTS recordings (~3K) | https://www.openslr.org/30/ |
TTS recordings for four South African languages (af, st, tn, xh) | https://www.openslr.org/32/ |
Large Javanese ASR training data set (~185K) | https://www.openslr.org/35/ |
Large Sundanese ASR training data set (~220K) | https://www.openslr.org/36/ |
High quality TTS data for Bengali languages | https://www.openslr.org/37/ |
High quality TTS data for Javanese | https://www.openslr.org/41/ |
High quality TTS data for Khmer | https://www.openslr.org/42/ |
High quality TTS data for Nepali | https://www.openslr.org/43/ |
High quality TTS data for Sundanese | https://www.openslr.org/44/ |
Large Sinhala ASR training data set | https://www.openslr.org/52/ |
Large Bengali ASR training data set | https://www.openslr.org/53/ |
Large Nepali ASR training data set | https://www.openslr.org/54/ |
Crowdsourced high-quality Argentinian Spanish speech data set | https://www.openslr.org/61/ |
Crowdsourced high-quality Malayalam multi-speaker speech data set | https://www.openslr.org/63/ |
Crowdsourced high-quality Marathi multi-speaker speech data set | https://www.openslr.org/64/ |
Crowdsourced high-quality Tamil multi-speaker speech data set | https://www.openslr.org/65/ |
Crowdsourced high-quality Telugu multi-speaker speech data set | https://www.openslr.org/66/ |
SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview
-
Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
-
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
-
Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.
Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.