Hi, does there any Chinese support?
Opened this issue · 12 comments
Hi, does there any Chinese support?
Yes, the model supports 3 Accents/Dialects of Chinese, namely:
- Mandarin
- Hakka
- Cantonese
To use one of these languages, you need to:
- phonemize your text with open source phonemizer.
- update text/cleaners.py:
You can find the list of all supported languages and associated identifiers here.
def phonemize_text(text: List[str] | str):
return phonemize(text, language="your identifier", backend="espeak", strip=True, preserve_punctuation=True, with_stress=True, tie=True, njobs=8)
- update the list of symbols if required. See text/symbols.py
(Make sure that_symbols
contains all unique symbols from your filelists)
@daniilrobnikov looks like the code gives potential support for Chinese, but there any trained models which can test the effect of Chinese?
Good question! I haven't tested the model on Chinese dataset yet.
At this point, I have trained the model on LJSpeech dataset for 18k steps out of 800k, and here are the results:
download.mp4
If you are interested, we can collaborate to train the model for Chinese
@daniilrobnikov for sure.
I would like to help. vits the one most elegant tts model i have ever seen, if there might be a multilang version of it espacially for Chinese would be very useful.
For now, I can contribute Chinese dataset at the begining, In Chinese, mostly using Biaobei dataset.
the dataset can be download from: https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar
Here the cn lexions which is special for Chinese that has tones 1,2,3,4 in Chinese pinyin.
_pause = ["sil", "eos", "sp", "#0", "#1", "#2", "#3"]
_initials = [
"^",
"b",
"c",
"ch",
"d",
"f",
"g",
"h",
"j",
"k",
"l",
"m",
"n",
"p",
"q",
"r",
"s",
"sh",
"t",
"x",
"z",
"zh",
]
_tones = ["1", "2", "3", "4", "5"]
_finals = [
"a",
"ai",
"an",
"ang",
"ao",
"e",
"ei",
"en",
"eng",
"er",
"i",
"ia",
"ian",
"iang",
"iao",
"ie",
"ii",
"iii",
"in",
"ing",
"iong",
"iou",
"o",
"ong",
"ou",
"u",
"ua",
"uai",
"uan",
"uang",
"uei",
"uen",
"ueng",
"uo",
"v",
"van",
"ve",
"vn",
]
symbols = _pause + _initials + [i + j for i in _finals for j in _tones]
# print(len(symbols))
Would u like me provide more assist for Chinese?
(I think with some preprocess with the data, you can start training with it quickly)
Thanks for sharing the code, I would greatly appreciate any support.
Also for cn lexions in Chinese, it is required to have multi-character tokenizer, as initially vits supports only by-character conversion.
I will research into possible alternatives
Great. Hoping for your newly updates. Yes, Chinese characters need convert to pinyin first, this should be labeled in GT data.
Heads-up on the Chinese support:
I have updated configs to include language param.
All you need is to assign the language identifier (found in here) to the language
data:
...
language: "id"
...
Also, currently I am working on the sub-word tokenizer for phonemes
Most likely will be using torchtext
, but I am open to your suggestions!
@daniilrobnikov thanks for the updates, looks very promising. so for Chinese it should using cmn
Does it automatically applied symbols for Mandarin?
Shall we using symbols for Chinese as I pasted above? It considered tones special for Chinese (like 1, 2, 3, 4 tones in pinyin)
Using the symbols I provided above, the model can work out of box as expected enough (although we can tune the performance slightly even further).
As far as I understand, for Mandarin-Chinese you would use cmn
identifier:
from phonemizer import phonemize
text = "语音合成技术是实现人机语音通信关键技术之一"
phonemes = phonemize(text, language="cmn", backend="espeak",
strip=True, preserve_punctuation=True, with_stress=True)
print("text: \t\t", text)
print("phonemes: \t", phonemes)
And the output from the phonemizer would look something like that:
text: 语音合成技术是实现人机语音通信关键技术之一
phonemes: jˈy2 jˈi5n χˈo-ɜ ts.hˈəɜŋ tɕˈi5 s.ˈu5 s.ˈi.5 s.ˈi.ɜ ɕˈiɛ5n ʐˈəɜn tɕˈi5 jˈy2 jˈi5n thˈonɡ5 ɕˈi5n kwˈa5n tɕˈiɛ5n tɕˈi5 s.ˈu5 ts.ˈi.5 jˈi5
NOTE: Here you can see that it not only phonemizes tones
, but also uses stresses
and preserves punctuation
, which may lead to better speech
inference, but harder to setup sub-word tokenizer
For now, you can add symbols
to text/cleaners.py that were not included initially like tones like this:
_pad = "_"
tones = "12345"
_symbols = " !\"',-.:;?abcdefhijklmnopqrstuvwxyz¡«»¿æçðøħŋœǀǁǂǃɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟɠɡɢɣɤɥɦɧɨɪɫɬɭɮɯɰɱɲɳɴɵɶɸɹɺɻɽɾʀʁʂʃʄʈʉʊʋʌʍʎʏʐʑʒʔʕʘʙʛʜʝʟʡʢʰʲʷʼˈˌːˑ˔˞ˠˡˤ˥˦˧˨˩̴̘̙̜̝̞̟̠̤̥̩̪̬̮̯̰̹̺̻̼͈͉̃̆̈̊̽͆̚͡βθχᵝᶣ—‖“”…‿ⁿ↑↓↗↘◌ⱱꜛꜜ︎ᵻ"
symbols = list(_pad) + list(_symbols) + list(tones)
I tried to tokenize this text after including tones
and it works correctly:
import torch
from utils.hparams import HParams
from utils.model import intersperse
from text import text_to_sequence, sequence_to_text, PAD_ID
def get_text(text: str, hps) -> torch.LongTensor:
text_norm = text_to_sequence(text, hps.data.text_cleaners, language=hps.data.language)
if hps.data.add_blank:
text_norm = intersperse(text_norm, PAD_ID)
text_norm = torch.LongTensor(text_norm)
return text_norm
hps = {
"data": {
"text_cleaners": ["phonemize_text"],
"language": "cmn",
"add_blank": False,
}
}
hps = HParams(**hps)
text = "语音合成技术是实现人机语音通信关键技术之一"
text = get_text(text, hps)
print(sequence_to_text(text.numpy()))
! For now, add tones to text/cleaners.py and it should be enough to start training on Mandarin
I will also test Mandarin symbols myself ASAP
Nice, for Chinese, things actually a little bit complicated, there is a special situation which same character pronouce differently in different sentences, this is called polyphone,
For instance:
因为天要下雨了,所以我准备收衣服。
means, since it's going to rain, am going to take my cloth inside.
This word 所以
normally, it's pinyin is suo3 yi3
both these two are 3 tone, but actually this pronouce it's weired in real life.
This should not effect training, since the pinyin should existed already in labeled data, but when in inference, phonmeizer mgiht can not handle this correclty.
this is actually a special part for Chinese only, may people have to using another model like BERT to predict correctly polyphone to pinyin, and then send to tokenizer part.
Am not sure if phonemizer can handle this. I have tried some multilang TTS system like PIPE, it actually can not handle this, so from mother language listen to their result are very weired.
this detail can very hard to notice if one not mother language in Chinese.
I see, so for Chinese language it is way more complicated.
As I understand, in context of Speech Synthesis, using phonemizer or any other g2p conversion is just bridge to make audio more correct. Meaning that during training, the model should account for such mistakes if it sees them in the training dataset.
I tested the model on English and Bengali, and the results are almost indistinguishable from source
For LJSpeech dataset, 81k steps out of 800k sound like that:
TTS.mp4
Compared to the ground truth:
GT.mp4
This is an audio file from test set, so the model didn't see it during training.
Considering the results, so far, I think it should work fine for Chinese, as well.
I will test the Chinese dataset and provide the results in the nearest release.
Also, I updated tokenizer and vocabulary symbols, so the model should work for Chinese right away
.
But, if you want to improve the results of g2p convesion, there is a paper Mixed-Phoneme BERT
which covers the topic of enhancing the TTS phoneme encoder based on phonemes and corresponding speech utterances:
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech