ming024/FastSpeech2

MFA version

shreeshailgan opened this issue · 4 comments

Hey @ming024,
Could you specify the MFA version you used to generate the textgrids you have provided in your repo? Also, did you generate those textgrids by just aligning using a pre-existing acoustic model or by using the train-and-align step on the dataset itself?

Asking because I've been using the latest MFA version (=3.0.0) and textgrid outputs I'm getting have alignment errors compared to the textgrids you have provided. This is also leading to issues in training since the model I trained using your provided textgrids works fine, but the model I trained using my own generated textgrids has issues - the quality of the synthesized audio degrades very fast with time. The audio is fine for the first 2-3 seconds, but then degrades very quickly after that.

Thanks.

Hi!

I'm also interested in this question. I've been training FS2 on a custom dataset. There's a pretrained MFA acoustic model for the language I've been training (Kazakh), but that model was trained on a very small corpus. Instead, I had quite a big one (30 h.).

I was training MFA from scratch with mfa train and the results were not consistent. There were some errors in alignments which lead to problems with phones durations extraction.

I also assume that phonemizing help to enhance the process. I've been training on graphemes.

Hi @asarsembayev,
For me, the issue of degraded model outputs was not due to errors in MFA's alignment, but because the preprocessing script was probably written to work with older versions of MFA. Newer versions contain the empty string "" in place of the sp token, which was being ignored in the preprocessing script, leading to wrong alignments.

I had to make a couple of changes to resolve this:

1] I converted empty tokens to sp around here

s, e, p = t.start_time, t.end_time, t.text

2] I added the argument read_empty_intervals=True when reading the TextGrids.

textgrid = tgt.io.read_textgrid(tg_path)

Hi @asarsembayev, For me, the issue of degraded model outputs was not due to errors in MFA's alignment, but because the preprocessing script was probably written to work with older versions of MFA. Newer versions contain the empty string "" in place of the sp token, which was being ignored in the preprocessing script, leading to wrong alignments.

I had to make a couple of changes to resolve this:

1] I converted empty tokens to sp around here

s, e, p = t.start_time, t.end_time, t.text

2] I added the argument read_empty_intervals=True when reading the TextGrids.

textgrid = tgt.io.read_textgrid(tg_path)

have you defined which old versions were used regarding the MFA?

I think it was 1.0.1. Since that was the latest version available when this repository first release their textgrids on LJSpeech.