MS Thesis : Towards cross-lingual voice adaptation for conversational speech

Voice adaptation and conversion have especially gained impetus in the personification of speech-enabled systems, movie dubbing, lecture dubbing, singing voice transformation, and voice adaptation for speech disordered patients. The accessibility of mobile phones in interior regions in India and numerous online educational content have led to the demand for lectures being available in regional languages. While transcreation of lecture videos in a number of different languages is a tall order, an attendant problem is the transcreation of the video in the original speaker’s voice. The voice synthesized in the target language needs to be matched to that of the source voice. This is difficult even for read speech but becomes even more complex for conversational speech. We examine classroom lectures with the objective of dubbing lectures from English to various Indian languages. The task is challenging as there are many problems associated with it. Firstly, classroom lectures are essentially conversational with fluctuations in speaking rate and contain disfluencies due to typical speaker mannerisms. Secondly, we attempt to perform cross-lingual voice transformation from English to Indian languages (e.g., Hindi, Kannada), which are phonotactically very different.

Most speech synthesis and voice conversion systems are trained on "read- speech," where the speech is rehearsed, unlike conversational speech, which is spontaneous. We analyze why Text-to-Speech (TTS) synthesis systems, which produce highly intelligible and robust audios for read speech, fail to model conversational speech. We compare read speech and conversational speech with respect to pitch variation, syllable rate variation, and signal-to-noise ratio (SNR) and identify the differences. Due to the lack of a conversational multispeaker dataset, we create our own dataset for the analysis task. Since the lecture transcriptions are generated by an Automatic Speech Recognition (ASR) model and manual curation is cumbersome, we devise data pruning techniques to curate the data and use this data to train a TTS model.

Further, to achieve the objective of dubbing lectures from English to Indian languages, a bilingual (Indian language + English) text-to-speech model trained on read speech is adapted to the required speaker’s voice using a minimally transcribed lecture recording. The novelty of this work lies in adapting read speech models using conversational speech data to generate the target speaker’s voice. The ASR-generated transcriptions are manually curated to maintain accurate text-audio correspondence. Two different frameworks have been used for adaptation – HTS (HMM-based speech synthesis system), a statistical parametric model, and E2E (End-to-End), a neural network-based model. X-vectors are used as speaker embeddings in the E2E framework to enhance speaker characteristics. The analysis and findings pave the way for further exploration of conversational TTS, cross-lingual voice adaptation, and voice conversion in a low-resource scenario.

BhagyashreeMukherjee/MS_Research_Thesis

MS Thesis : Towards cross-lingual voice adaptation for conversational speech