yl4579/StyleTTS

Probleam about data processing

Zhongxu-Wang opened this issue · 1 comments

I found this line of code in the meldataset.py file and I was curious about what it does. Why does wav need to be extended in the code?
wave = torch.cat([torch.zeros([5000]), wave, torch.zeros([5000])], axis=0)

yl4579 commented

This is to compensate for the start of the text and the end of the text (token index 0) for the text aligner. We append a silence token at the beginning and end of the text to make it align leading and ending silences, but some datasets like LibriTTS and LJSpeech do not have silences at the beginning and the end, so we add some silences at the beginning and the end of the sentence for robustness.