heatz123/naturalspeech

Any plan of Reproducing mix-phoneme BERT ?

TinaChen95 opened this issue · 3 comments

Hi there,

First of all, I want to thank you for your great work. It has been incredibly helpful for us.

However, I have noticed that you are not using the same phoneme encoder structure as in the original paper. In the paper, they used mix-phoneme as input, but in your code, only phoneme input is utilized.

I was wondering if you have any plans to reproduce the original mix-phoneme BERT model structure, or if there are any other updates or changes to the model that you plan to make.

Thank you for your time and consideration. I look forward to hearing from you soon.

Hi @TinaChen95!

As you might have noticed, this implementation doesn't include mixed-phoneme pretraining, which requires implementing the paper Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech.

I chose to use phoneme input since I think the performance benefit of mixed-phoneme input would only come from pretraining with a large corpus, and this pretraining implementation would require nontrivial efforts. But as I didn't really tested with mixed-phoneme input itself, it would be nice to see if this change can help.

Currently, I don't have any plans to add mixed-phoneme pretraining or any other updates to the implementation. However, if I decide to include more features, I'll make a notification.

Thank you for your interest. If you have any further questions or concerns, please let me know.

@TinaChen95 if you are interested in MP-BERT follow this repo : https://github.com/zjwang21/mix-phoneme-bert

@rishikksh20 Thanks! That's very helpful.