How to extract alignment from tacotron2?

Question

How to extract alignment from tacotron2?

CanKorkut opened this issue 4 years ago · 6 comments

Hi,

I want to try fastspeech on different dataset. therefore, can you share how to extract alignment from tacotron2?

I tried this code, but get bad result for synthesis when inference long sentence .

_, _, _, alignments = model.inference(sequence)
d = alignments.float().data.cpu().numpy()[0].T
x = np.zeros(d.shape[0])
for i,y in enumerate(d):
x[i] = y.sum()
np.save("path_to_save_folder"+name+".npy",x.astype(np.dtype('i4')))

Thank you.

Answer 1 · 2020-10-02T08:53:06.000Z

Why are alignments used for after all? Tacotron-2 paper will not mention alignments.

Answer 2 · 2020-10-02T11:51:05.000Z

I found this in FastSpeech2 paper:

The training of FastSpeech relies on an autoregressive teacher model to provide 1) the duration of each phoneme to train a duration predictor, and 2) the generated mel-spectrograms for knowledge distillation. While these designs in FastSpeech ease the learning of the one-to-many mapping problem in TTS, they also bring several disadvantages: 1) the two-stage teacher-student distillation pipeline is complicated; 2) the duration extracted from the attention map of the teacher model is not accurate enough, and the target mel-spectrograms distilled from the teacher model suffer from information loss due to data simplification, both of which limit the voice quality and prosody.

This speaks clearly that you need another trained model to work with FastSpeech custom dataset, which is not so smart.

Or, the alignments are such a big problem, because based on those alignments the the training is possible. No alignments, no training. This paper "FastSpeech" is worth inspecting to understand how is done (in principle), but for some out of the box training possible is not the best choice.

You may find the alignments.py file was present in this project before but was removed. Commit id: e11b60d, but no commit message has been set to explain.

Answer 3 · 2020-10-02T12:21:43.000Z

Thank you, i found alignments.py previous commit and tried it. In result, synthesis quality not bad, but when i inference long sentence long than five or six words, there was stuttering and missing letters problem in synthesis. Now i try FastSpeech2. Alignments are really such a big problem.

Answer 4 · 2020-12-30T17:06:26.000Z

Hi, i have the same question. I also try to train my language with FastSpeech2, but alignments are really difficult.
My tacotron2 model is trained very good with my dataset. Therefore, its alignment will be good, but synthesis is quite bad.
They seem to be able to understand and are mixed. So, my question is whether durations generated by Tacotron matchs mels, energies, pitches generated by librosa or TacotronSTFT module. This problem is so complexity to explain how to FastSpeech2 made good quality audios. Thanks

Answer 5 · 2021-01-05T12:07:30.000Z

I researched this problem and saw something about reduction factor. I didn't clearly understand architecture but we can say tacotron can easly learn with large reduction factor, however there is no reduction factor nvidia tacotron2 implementation. Maybe nvidia tacotron good for synthesis but it bad at for extract alignment. I'm not sure, i will research and editing.

Answer 6 · 2021-07-02T04:14:44.000Z

@CanKorkut Hi, i'm using that alignment.py (Commit id: e11b60d) to extract alignments files but the result show different dimension with LJSpeech alignment files (in this source code Fast Speech already had). Can you show me your code to extract exactly alignment files to train another language ? thank you