text to speech, a repro for producing human natural speech
NaturalTTS is revised from NaturalSpeech2 and change the Diffusion to Flow
use wavenet and attention. As descripted in Figure 4 in NaturalSpeech2 paper, but change a little bit. the changes is: 1. keep the flow architecture from VITS 2. wrape the wavenet, add attention and FiLM to it, as description in Figure 4
- re-sample audio file: dataset/resample.py
- tokenize text and change to phonone: preprocess.py
- generate pitch from audio using preprocess.gen_pitch.py
all run scripts are in recipes folder. each model one folder
#- pitch(f0) can be generated using three methods, we use 4:
- using pysptk.sptk.rapt from pysptk project
- using librosa.pyin from librosa project. util.audio_process.py has the function of compute_f0() for it. but it's very slow.
- as used in NaturalSpeech2, use pyWord in https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder
- FCNF0++, code in https://github.com/interactiveaudiolab/penn, the latest pitch deteck model.
- Model checkpoint: https://huggingface.co/maxrmorrison/fcnf0-plus-plus/tree/main
#- duration:
- duration is generated by VITS model, by running the reference of VITS to get the duration. see code in preprocess.gen_audio_stat.py
- fails to use VITS, switch to Montreal-Forced-Aligner tool, see more detail: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner 2.1. here are some useful introduction of Montreal-Forced-Aligner: 1) https://techfirst.medium.com/forced-alignment-how-to-match-audio-with-a-transcript-via-machine-learning-dd19da8c0f04
- there are many tools to align audio and text, start from here: https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner
#- speaker: speaker embedding is generated from H/ASP model paper: Clova baseline system for the voxceleb speaker recognition chal�lenge speaker enbedding encoder configs: model checkpoint: https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar configs: https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/config_se.json
- about symbols of Chinese and Japanese, see detail in yl4579/StyleTTS#10
- extract emotion from wav file using wav2vec2 model: https://github.com/audeering/w2v2-how-to/blob/main/notebook.ipynb
#- projects
- https://github.com/yangdongchao/AcademiCodec
- https://github.com/CODEJIN/NaturalSpeech2
- https://github.com/heatz123/naturalspeech
- https://github.com/coqui-ai/TTS
- some usefull course about audio AI: https://www.youtube.com/watch?v=iCwMQJnKk2c&list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0
- how to augement audio for AI research: https://www.youtube.com/watch?v=bm1cQfb_pLA&list=PL-wATfeyAMNoR4aqS-Fv0GRmS6bx5RtTW&index=2
- audio augementation tools:
- librosa
- audiomentations
- torch-audiomentations
- torchaudio.transforms
- useful checkpoints
- H/ASP speaker embedding model checkpoint: https://github.com/coqui-ai/TTS/releases/download/speaker_encoder_model/model_se.pth.tar
- RQV quantizer checkpoint: https://huggingface.co/Dongchao/AcademiCodec/tree/main
- many models includes in TTS: https://github.com/mozilla/TTS/wiki/Released-Models
- learning rate非常重要. vits用大learning rate失败, 但是用 2e-4 成功. 可以用fastai里面的find_lr方法获得最佳lr
- 在刚开始学习的时候,很容易产生loss=NaN,甚至简单的weight_norm(Conv1d(in_channels, upsample_initial_channel, 7, 1, padding=3)) 都可能让结果变成NaN
- 尝试多次训练forward KL and backword KL,都失败了,naturalspeech论文上的内容应该是有问题的。从原理上说应该也不需要两个KL