vits_bert.mp4
天空呈现的透心的蓝,像极了当年。总在这样的时候,透过窗棂,心,在天空里无尽的游弋!柔柔的,浓浓的,痴痴的风,牵引起心底灵动的思潮;情愫悠悠,思情绵绵,风里默坐,红尘中的浅醉,诗词中的优柔,任那自在飞花轻似梦的情怀,裁一束霓衣,织就清浅淡薄的安寂。
风的影子翻阅过淡蓝色的信笺,柔和的文字浅浅地漫过我安静的眸,一如几朵悠闲的云儿,忽而氤氲成汽,忽而修饰成花,铅华洗尽后的透彻和靓丽,爽爽朗朗,轻轻盈盈
时光仿佛有穿越到了从前,在你诗情画意的眼波中,在你舒适浪漫的暇思里,我如风中的思绪徜徉广阔天际,仿佛一片沾染了快乐的羽毛,在云环影绕颤动里浸润着风的呼吸,风的诗韵,那清新的耳语,那婉约的甜蜜,那恬淡的温馨,将一腔情澜染得愈发的缠绵。
1, Hidden prosody embedding from BERT,get natural pauses in grammar
2, Infer loss from NaturalSpeech,get less sound error
3, Framework of VITS,get high audio quality
💗Tip: It is recommended to use Infer Loss fine-tune model after base model trained, and freeze PosteriorEncoder during fine-tuning.
💗意思就是:初步训练时,不用loss_kl_r;训练好后,添加loss_kl_r继续训练,稍微训练一下就行了,如果音频质量差,可以给loss_kl_r乘以一个小于1的系数、降低loss_kl_r对模型的贡献;继续训练时,可以尝试冻结音频编码器Posterior Encoder;总之,玩法很多,需要多尝试!
https://huggingface.co/spaces/maxmax20160403/vits_chinese
pip install -r requirements.txt
cd monotonic_align
python setup.py build_ext --inplace
Get from release page vits_chinese/releases/
put prosody_model.pt To ./bert/prosody_model.pt
put vits_bert_model.pth To ./vits_bert_model.pth
python vits_infer.py --config ./configs/bert_vits.json --model vits_bert_model.pth
./vits_infer_out have the waves infered, listen !!!
as key paramter, hop_frame = ∑decoder.ups.padding 💗
python vits_infer_stream.py --config ./configs/bert_vits.json --model vits_bert_model.pth
download baker data https://aistudio.baidu.com/datasetdetail/36741, more info: https://www.data-baker.com/data/index/TNtts/
change sample rate of waves to 16kHz, and put waves to ./data/waves
python vits_resample.py -w [input path]:[./data/Wave/] -o ./data/waves -s 16000
put 000001-010000.txt to ./data/000001-010000.txt
python vits_prepare.py -c ./configs/bert_vits.json
python train.py -c configs/bert_vits.json -m bert_vits
原始标注为
000001 卡尔普#2陪外孙#1玩滑梯#4。
ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
000002 假语村言#2别再#1拥抱我#4。
jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
标注规整后:
- BERT需要汉字
卡尔普陪外孙玩滑梯。
(包括标点) - TTS需要声韵母
sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
000001 卡尔普陪外孙玩滑梯。
ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
000002 假语村言别再拥抱我。
jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil
训练标注为
./data/wavs/000001.wav|./data/temps/000001.spec.pt|./data/berts/000001.npy|sil k a2 ^ er2 p u3 p ei2 ^ uai4 s uen1 ^ uan2 h ua2 t i1 sp sil
./data/wavs/000002.wav|./data/temps/000002.spec.pt|./data/berts/000002.npy|sil j ia2 ^ v3 c uen1 ^ ian2 b ie2 z ai4 ^ iong1 b ao4 ^ uo3 sp sil
遇到这句话会出错
002365 这图#2难不成#2是#1P过的#4?
zhe4 tu2 nan2 bu4 cheng2 shi4 P IY1 guo4 de5
将正确的词语和拼音写入文件: ./text/pinyin-local.txt
渐渐 jian4 jian4
浅浅 qian3 qian3
已支持,基于WeNet开源社区WeTextProcessing;当然,不可能是完美的
python vits_infer_no_bert.py --config ./configs/bert_vits.json --model vits_bert_model.pth
虽然训练使用了Bert,但推理可以完全不用Bert,牺牲自然停顿来适配低计算资源设备,比如手机
低资源设备通常会分句合成,这样牺牲的自然停顿也没那么明显
导出:会有许多警告,直接忽略
python model_onnx.py --config configs/bert_vits.json --model vits_bert_model.pth
推理
python vits_infer_onnx.py --model vits-chinese.onnx
具体实现,将VITS拆解为两个模型,取名为Encoder和Decoder。
-
Encoder包括TextEncoder与DurationPredictor等;
-
Decoder包括ResidualCouplingBlock与Generator等;
并且将推理逻辑也进行了切分;特别的,先验分布的采样过程放在了Encoder中:
z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
ONNX流式模型导出
python model_onnx_stream.py --config configs/bert_vits.json --model vits_bert_model.pth
ONNX流式模型推理
python vits_infer_onnx_stream.py --encoder vits-chinese-encoder.onnx --decoder vits-chinese-decoder.onnx
在流式推理中,hop_frame是一个重要参数,需要去尝试出合适的值
Student model has 53M size and 3× speed of teacher model.
To train:
python train.py -c configs/bert_vits_student.json -m bert_vits_student
To infer, get studet model at release page
python vits_infer.py --config ./configs/bert_vits_student.json --model vits_bert_student.pth
需要到 https://huggingface.co/jackyqs/vits-aishell3-175-chinese/tree/main 下载模型
详细见 https://github.com/csukuangfj/vits_chinese/tree/master/aishell3
可试用 https://huggingface.co/spaces/k2-fsa/text-to-speech
Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
https://github.com/Executedone/Chinese-FastSpeech2 bert prosody
https://github.com/wenet-e2e/WeTextProcessing
https://github.com/jaywalnut310/vits
https://github.com/wenet-e2e/wetts
https://github.com/csukuangfj onnx and android