heatz123/naturalspeech

Why is the shape of the duration file 2n+1 instead of n?

Opened this issue · 2 comments

I am using MFA to extract the duration file, but when I refer to other repositories, I know that the duration file is shaped like a phoneme with the start-end time difference of each phoneme.
Therefore, the shape of the duration file is 'N', which is the same as the number of phonemes in the textgrid, but it is said that '2N+1' shapes are needed to run the model. To meet this, the repo above seems to increase the number of phonemes with the 'add_blank' option.

I'm curious how the duration file is organized in the above repo.
Also, I wonder if it is necessary to match these formats.

fkwlqm commented

@ozingmw where is it said that we need 2N+1 shapes?

I have the same confusion.
Yes, the add_blank parameter makes the total length 2N+1, refer to this line of code:

text_norm = commons.intersperse(text_norm, 0)

In this case, how can I use the results of MFA?