Mandarin/Chinese Text to Speech based on statistical parametric speech synthesis using merlin toolkit
This is only a demo of mandarin frontend which is lack of some parts like "text normalasition" and "prosody prediction", and the phone set && Question Set this project use havn't fully tested yet.
A draft documentation written in Mandarin
There is no open-source mandarin speech synthesis dataset on the internet, this proj used thchs30 dataset to demostrate speech synthesis
Listen to https://jackiexiao.github.io/MTTS/
- First, you need data contain wav and txt (prosody mark is optional)
- Second, generate HTS label using this project
- Using merlin/egs/mandarin_voice to train and generate Mandarin Voice
Python : python3.6
System: linux(tested on ubuntu16.04)
pip install jieba pypinyin
sudo apt-get install libatlas3-base
Run bash tools/install_mtts.sh
Or download file by yourself
- Download montreal-forced-aligner and unzip to directory tools/
- Download acoustic_model thchs30.zip and copy to directory misc/
Run Demo
bash run_demo.sh
- Usage: Run
python src/mtts.py txtfile wav_directory_path output_directory_path
(Absolute path or relative path) Then you will get HTS label, if you have your own acoustic model trained by monthreal-forced-aligner, add-a your_acoustic_model.zip
, otherwise, this project use thchs30.zip acoustic model as default - Attention: Currently only support Chinese Character, txt should not have any Arabia number or English alphabet(不可包含阿拉伯数字和英文字符)
txtfile example
A_01 这是一段文本
A_02 这是第二段文本
wav_directory example(Sampleing Rate should larger than 16khz)
A_01.wav
A_02.wav
- Usage: Run
python src/mandarin_frontend.py txtfile output_directory_path
- or import mandarin_frontend
from mandarin_frontend import txt2label
result = txt2label('向香港特别行政区同胞澳门和**同胞海外侨胞')
[print(line) for line in result]
# with prosody mark and alignment file (sfs file)
# result = txt2label('向#1香港#2特别#1行政区#1同胞#4澳门#2和#1**#1同胞#4海外#1侨胞',
sfsfile='example_file/example.sfs')
see source
code for more information, but pay attention to the alignment file(sfs file), the format is endtime phone_type
not start_time, phone_type
(which is different from speech ocean's data)
This project use Montreal-Forced-Aligner to do forced alignment
- We trained the acoustic model using thchs30 dataset, see
misc/thchs30.zip
, the dictionary we use mandarin_mtts.lexicon. If you use larger dataset than thchs30, you may get better alignment. - If you want to use mfa's (montreal-forced-aligner) pre-trained mandarin model, this is the dictionary you need mandarin-for-montreal-forced-aligner-pre-trained-model.lexicon
You can generate HTS Label without prosody mark. we assume that word segment is smaller than prosodic word(which is adjusted in code)
"#0","#1", "#2","#3" and "#4" are the prosody labeling symbols.
- #0 stands for word segment
- #1 stands for prosodic word
- #2 stands for stressful word (actually in this project we regrad it as #1)
- #3 stands for prosodic phrase
- #4 stands for intonational phrase
- Text Normalization
- Better Chinese word segment
- G2P: Polyphone Problem
- Better Label format and Question Set
- Improvement of prosody analyse
- Better alignment
- Jackiexiao
- willian56