6. Gather more multi-lingual data

Question

6. Gather more multi-lingual data

jpc opened this issue 2 years ago · 3 comments

Right now we are using (a subset) of Libri Lite which is a very big (60k hours) dataset of audiobooks read by thousands of speakers. It is pretty good but there is a lot of (probably more expressive and emotional) speech available in YouTube videos. For the final training run it would be great to have more varied data to improve the quality of the model.

Answer 1 · 2024-01-18T15:38:43.000Z

Approximately 10,000 hours of Chinese audio recordings are available here. https://github.com/wenet-e2e/WenetSpeech

Answer 2 · 2024-01-22T14:28:28.000Z

I think we need native speakers to ensure high quality material and build the best global open source TTS system.

I am thinking of setting up a common format and some docs to help people prepare, validate and upload multilingual speech data to Huggingface to include into WhisperSpeech base model training.

Answer 3 · 2024-01-30T02:19:21.000Z

Hi

Native Arabic speaker here. Just ping me once you're ready.