/vits_chinese

Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft; Support ONNX streaming out!

Primary LanguagePythonMIT LicenseMIT

Best practice TTS based on BERT and VITS with some Natural Speech Features Of Microsoft

Hugging Face Spaces GitHub Repo stars GitHub forks GitHub issues GitHub

这是一个用于TTS算法学习的项目,如果您在寻找直接用于生产的TTS,本项目可能不适合您!

vits_bert.mp4

天空呈现的透心的蓝,像极了当年。总在这样的时候,透过窗棂,心,在天空里无尽的游弋!柔柔的,浓浓的,痴痴的风,牵引起心底灵动的思潮;情愫悠悠,思情绵绵,风里默坐,红尘中的浅醉,诗词中的优柔,任那自在飞花轻似梦的情怀,裁一束霓衣,织就清浅淡薄的安寂。

风的影子翻阅过淡蓝色的信笺,柔和的文字浅浅地漫过我安静的眸,一如几朵悠闲的云儿,忽而氤氲成汽,忽而修饰成花,铅华洗尽后的透彻和靓丽,爽爽朗朗,轻轻盈盈

时光仿佛有穿越到了从前,在你诗情画意的眼波中,在你舒适浪漫的暇思里,我如风中的思绪徜徉广阔天际,仿佛一片沾染了快乐的羽毛,在云环影绕颤动里浸润着风的呼吸,风的诗韵,那清新的耳语,那婉约的甜蜜,那恬淡的温馨,将一腔情澜染得愈发的缠绵。

Features,特性

1, Hidden prosody embedding from BERT,get natural pauses in grammar

2, Infer loss from NaturalSpeech,get less sound error

3, Framework of VITS,get high audio quality

💗Tip: It is recommended to use Infer Loss fine-tune model after base model trained, and freeze PosteriorEncoder during fine-tuning.

💗意思就是:初步训练时,不用loss_kl_r;训练好后,添加loss_kl_r继续训练,稍微训练一下就行了,如果音频质量差,可以给loss_kl_r乘以一个小于1的系数、降低loss_kl_r对模型的贡献;继续训练时,可以尝试冻结音频编码器Posterior Encoder;总之,玩法很多,需要多尝试!

Online demo,在线体验

https://huggingface.co/spaces/maxmax20160403/vits_chinese

Install,安装依赖和MAS对齐

pip install -r requirements.txt

cd monotonic_align

python setup.py build_ext --inplace

Infer with Pretrained model,用示例模型推理

Get from release page vits_chinese/releases/

put prosody_model.pt To ./bert/prosody_model.pt

put vits_bert_model.pth To ./vits_bert_model.pth

python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth

./vits_infer_out have the waves infered, listen !!!

Infer with chunk wave streaming out,分块流式推理

as key paramter, hop_frame = ∑decoder.ups.padding 💗

python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth

Train,训练

download baker data https://aistudio.baidu.com/datasetdetail/36741, more info: https://www.data-baker.com/data/index/TNtts/

change sample rate of waves to 16kHz, and put waves to ./data/waves

python vits_resample.py -w [input path]:[./data/Wave/] -o ./data/waves -s 16000

put 000001-010000.txt to ./data/000001-010000.txt

python vits_prepare.py -c ./configs/bert_vits.json
python train.py -c configs/bert_vits.json -m bert_vits

bert_lose

额外说明

原始标注为

000001	卡尔普#2陪外孙#1玩滑梯#4ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
000002	假语村言#2别再#1拥抱我#4jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3

标注规整后:

  • BERT需要汉字 卡尔普陪外孙玩滑梯。 (包括标点)
  • TTS需要声韵母 sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
000001	卡尔普陪外孙玩滑梯ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
  sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
000002	假语村言别再拥抱我jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
  sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil

训练标注为

./data/wavs/000001.wav|./data/temps/000001.spec.pt|./data/berts/000001.npy|sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
./data/wavs/000002.wav|./data/temps/000002.spec.pt|./data/berts/000002.npy|sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil

遇到这句话会出错

002365	这图#2难不成#2是#1P过的#4?
  zhe4 tu2 nan2 bu4 cheng2 shi4 P IY1 guo4 de5

拼音错误修改

将正确的词语和拼音写入文件: ./text/pinyin-local.txt

渐渐 jian4 jian4
浅浅 qian3 qian3

数字播报支持

已支持,基于WeNet开源社区WeTextProcessing;当然,不可能是完美的

不使用Bert也能推理

python vits_infer_no_bert.py --config ./configs/bert_vits.json --model vits_bert_model.pth

虽然训练使用了Bert,但推理可以完全不用Bert,牺牲自然停顿来适配低计算资源设备,比如手机

低资源设备通常会分句合成,这样牺牲的自然停顿也没那么明显

ONNX非流式

导出:会有许多警告,直接忽略

python model_onnx.py --config configs/bert_vits.json --model vits_bert_model.pth

推理

python vits_infer_onnx.py --model vits-chinese.onnx

ONNX流式

具体实现,将VITS拆解为两个模型,取名为Encoder和Decoder。

  • Encoder包括TextEncoder与DurationPredictor等;

  • Decoder包括ResidualCouplingBlock与Generator等;

并且将推理逻辑也进行了切分;特别的,先验分布的采样过程放在了Encoder中:

z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale

ONNX流式模型导出

python model_onnx_stream.py --config configs/bert_vits.json --model vits_bert_model.pth

ONNX流式模型推理

python vits_infer_onnx_stream.py --encoder vits-chinese-encoder.onnx --decoder vits-chinese-decoder.onnx

在流式推理中,hop_frame是一个重要参数,需要去尝试出合适的值

Model compression based on knowledge distillation,应该叫迁移学习还是知识蒸馏呢?

Student model has 53M size and 3× speed of teacher model.

To train:

python train.py -c configs/bert_vits_student.json -m bert_vits_student

To infer, get studet model at release page

python vits_infer.py --config ./configs/bert_vits_student.json --model vits_bert_student.pth

多发音人与克隆,基于AISHELL3的预训练模型

需要到 https://huggingface.co/jackyqs/vits-aishell3-175-chinese/tree/main 下载模型

详细见 https://github.com/csukuangfj/vits_chinese/tree/master/aishell3

可试用 https://huggingface.co/spaces/k2-fsa/text-to-speech

代码来源

Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

https://github.com/Executedone/Chinese-FastSpeech2 bert prosody

https://github.com/wenet-e2e/WeTextProcessing

https://github.com/jaywalnut310/vits

https://github.com/wenet-e2e/wetts

https://github.com/csukuangfj onnx and android