collabora/WhisperSpeech

6. Gather more multi-lingual data

jpc opened this issue · 3 comments

jpc commented

Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.

Approximately 10,000 hours of Chinese audio recordings are available here. https://github.com/wenet-e2e/WenetSpeech

jpc commented

I think we need native speakers to ensure high quality material and build the best global open source TTS system.

I am thinking of setting up a common format and some docs to help people prepare, validate and upload multilingual speech data to Huggingface to include into WhisperSpeech base model training.

Hi

Native Arabic speaker here. Just ping me once you're ready.

Is this affiliated with Open Empathetic?