Shruti - Nepali Speech Synthesis

Speech Synthesis Component of Shruti - A Nepali Audiobook Platform


The text-to-speech system has two components:

  1. Melspectrogram Generation
  • Finetuning Tacotron2(Shen et.al) for melspectrogram generation
  1. Vocoder Output

Training Data

  • Pretrained Tacotron2 model trained on The LJSpeech Dataset(Ito and Johnson)
  • Finetuning Phase 1 - High quality TTS data for Nepali(Sodimana et.al)
  • Finetuning Phase 2 - Created own Dataset;Nepali Text-to-Speech Data (Male and Female)(Khadka et.al)

Find the output samples here and the paper here.


If you use the code or dataset, please cite our work and all the references that we have cited.

  title={Nepali Text-to-Speech Synthesis using Tacotron2 for Melspectrogram Generation},
  author={Khadka, Supriya and Ranju, GC and Paudel, Prabin and Shah, Rahul and Joshi, Basanta},
  booktitle={Proc. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages (SIGUL 2023)},
  pages={73--77},
  year={2023}
}