daniilrobnikov/vits2

Hi, does there any Chinese support?

Opened this issue · 12 comments

Hi, does there any Chinese support?

Yes, the model supports 3 Accents/Dialects of Chinese, namely:

  • Mandarin
  • Hakka
  • Cantonese

To use one of these languages, you need to:

  1. phonemize your text with open source phonemizer.
  2. update text/cleaners.py:
    You can find the list of all supported languages and associated identifiers here.
def phonemize_text(text: List[str] | str):
    return phonemize(text, language="your identifier", backend="espeak", strip=True, preserve_punctuation=True, with_stress=True, tie=True, njobs=8)
  1. update the list of symbols if required. See text/symbols.py
    (Make sure that _symbols contains all unique symbols from your filelists)

@daniilrobnikov looks like the code gives potential support for Chinese, but there any trained models which can test the effect of Chinese?

Good question! I haven't tested the model on Chinese dataset yet.

At this point, I have trained the model on LJSpeech dataset for 18k steps out of 800k, and here are the results:

download.mp4

If you are interested, we can collaborate to train the model for Chinese

@daniilrobnikov for sure.

I would like to help. vits the one most elegant tts model i have ever seen, if there might be a multilang version of it espacially for Chinese would be very useful.

For now, I can contribute Chinese dataset at the begining, In Chinese, mostly using Biaobei dataset.

the dataset can be download from: https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

Here the cn lexions which is special for Chinese that has tones 1,2,3,4 in Chinese pinyin.

_pause = ["sil", "eos", "sp", "#0", "#1", "#2", "#3"]

_initials = [
    "^",
    "b",
    "c",
    "ch",
    "d",
    "f",
    "g",
    "h",
    "j",
    "k",
    "l",
    "m",
    "n",
    "p",
    "q",
    "r",
    "s",
    "sh",
    "t",
    "x",
    "z",
    "zh",
]

_tones = ["1", "2", "3", "4", "5"]

_finals = [
    "a",
    "ai",
    "an",
    "ang",
    "ao",
    "e",
    "ei",
    "en",
    "eng",
    "er",
    "i",
    "ia",
    "ian",
    "iang",
    "iao",
    "ie",
    "ii",
    "iii",
    "in",
    "ing",
    "iong",
    "iou",
    "o",
    "ong",
    "ou",
    "u",
    "ua",
    "uai",
    "uan",
    "uang",
    "uei",
    "uen",
    "ueng",
    "uo",
    "v",
    "van",
    "ve",
    "vn",
]

symbols = _pause + _initials + [i + j for i in _finals for j in _tones]

# print(len(symbols))

Would u like me provide more assist for Chinese?
(I think with some preprocess with the data, you can start training with it quickly)

Thanks for sharing the code, I would greatly appreciate any support.
Also for cn lexions in Chinese, it is required to have multi-character tokenizer, as initially vits supports only by-character conversion.
I will research into possible alternatives

Great. Hoping for your newly updates. Yes, Chinese characters need convert to pinyin first, this should be labeled in GT data.

Heads-up on the Chinese support:
I have updated configs to include language param.

All you need is to assign the language identifier (found in here) to the language

data:
    ...
    language: "id"
    ...

Also, currently I am working on the sub-word tokenizer for phonemes
Most likely will be using torchtext, but I am open to your suggestions!

@daniilrobnikov thanks for the updates, looks very promising. so for Chinese it should using cmn Does it automatically applied symbols for Mandarin?
Shall we using symbols for Chinese as I pasted above? It considered tones special for Chinese (like 1, 2, 3, 4 tones in pinyin)

Using the symbols I provided above, the model can work out of box as expected enough (although we can tune the performance slightly even further).

As far as I understand, for Mandarin-Chinese you would use cmn identifier:

from phonemizer import phonemize


text = "语音合成技术是实现人机语音通信关键技术之一"
phonemes = phonemize(text, language="cmn", backend="espeak",
                     strip=True, preserve_punctuation=True, with_stress=True)
print("text: \t\t", text)
print("phonemes: \t", phonemes)

And the output from the phonemizer would look something like that:

text: 		 语音合成技术是实现人机语音通信关键技术之一
phonemes: 	 jˈy2 jˈi5n χˈo-ɜ ts.hˈəɜŋ tɕˈi5 s.ˈu5 s.ˈi.5 s.ˈi.ɜ ɕˈiɛ5n ʐˈəɜn tɕˈi5 jˈy2 jˈi5n thˈonɡ5 ɕˈi5n kwˈa5n tɕˈiɛ5n tɕˈi5 s.ˈu5 ts.ˈi.5 jˈi5

NOTE: Here you can see that it not only phonemizes tones, but also uses stresses and preserves punctuation, which may lead to better speech inference, but harder to setup sub-word tokenizer

For now, you can add symbols to text/cleaners.py that were not included initially like tones like this:

_pad = "_"
tones = "12345"
_symbols = " !\"',-.:;?abcdefhijklmnopqrstuvwxyz¡«»¿æçðøħŋœǀǁǂǃɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟɠɡɢɣɤɥɦɧɨɪɫɬɭɮɯɰɱɲɳɴɵɶɸɹɺɻɽɾʀʁʂʃʄʈʉʊʋʌʍʎʏʐʑʒʔʕʘʙʛʜʝʟʡʢʰʲʷʼˈˌːˑ˔˞ˠˡˤ˥˦˧˨˩̴̘̙̜̝̞̟̠̤̥̩̪̬̮̯̰̹̺̻̼͈͉̃̆̈̊̽͆̚͡βθχᵝᶣ—‖“”…‿ⁿ↑↓↗↘◌ⱱꜛꜜ︎ᵻ"
symbols = list(_pad) + list(_symbols) + list(tones)

I tried to tokenize this text after including tones and it works correctly:

import torch

from utils.hparams import HParams
from utils.model import intersperse
from text import text_to_sequence, sequence_to_text, PAD_ID


def get_text(text: str, hps) -> torch.LongTensor:
    text_norm = text_to_sequence(text, hps.data.text_cleaners, language=hps.data.language)
    if hps.data.add_blank:
        text_norm = intersperse(text_norm, PAD_ID)
    text_norm = torch.LongTensor(text_norm)
    return text_norm


hps = {
    "data": {
        "text_cleaners": ["phonemize_text"],
        "language": "cmn",
        "add_blank": False,
    }
}
hps = HParams(**hps)
text = "语音合成技术是实现人机语音通信关键技术之一"
text = get_text(text, hps)
print(sequence_to_text(text.numpy()))

! For now, add tones to text/cleaners.py and it should be enough to start training on Mandarin
I will also test Mandarin symbols myself ASAP

Nice, for Chinese, things actually a little bit complicated, there is a special situation which same character pronouce differently in different sentences, this is called polyphone

For instance:

因为天要下雨了,所以我准备收衣服。

means, since it's going to rain, am going to take my cloth inside.

This word 所以 normally, it's pinyin is suo3 yi3 both these two are 3 tone, but actually this pronouce it's weired in real life.

This should not effect training, since the pinyin should existed already in labeled data, but when in inference, phonmeizer mgiht can not handle this correclty.

this is actually a special part for Chinese only, may people have to using another model like BERT to predict correctly polyphone to pinyin, and then send to tokenizer part.

Am not sure if phonemizer can handle this. I have tried some multilang TTS system like PIPE, it actually can not handle this, so from mother language listen to their result are very weired.
this detail can very hard to notice if one not mother language in Chinese.

I see, so for Chinese language it is way more complicated.

As I understand, in context of Speech Synthesis, using phonemizer or any other g2p conversion is just bridge to make audio more correct. Meaning that during training, the model should account for such mistakes if it sees them in the training dataset.

I tested the model on English and Bengali, and the results are almost indistinguishable from source
For LJSpeech dataset, 81k steps out of 800k sound like that:

TTS.mp4

Compared to the ground truth:

GT.mp4

This is an audio file from test set, so the model didn't see it during training.

Considering the results, so far, I think it should work fine for Chinese, as well.
I will test the Chinese dataset and provide the results in the nearest release.

Also, I updated tokenizer and vocabulary symbols, so the model should work for Chinese right away.

But, if you want to improve the results of g2p convesion, there is a paper Mixed-Phoneme BERT which covers the topic of enhancing the TTS phoneme encoder based on phonemes and corresponding speech utterances:
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech