pretrain decoder for fewer parallel corpus.
Paper: Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis
Reference: Rayhane-mamah/Tacotron-2
-
pretrain decoder in Tacotron-2
- put wavs into one dir, the hierarchy should be:
project └───decoder_pretrain_wavs │ │ │ └───haitian │ │ 000001.wav │ │ 000002.wav │ │ ... │ │ │ └───biaobei │ │ 10001.wav │ │ 00003.wav │ │ ...
then run
python /utils/prepare_pretrain_decoder_corpus.py
-
update hparams.py
decoder_pretrain=True, decoder_init_checkpoint='', restore_pretrain_decoder=False,
-
preprocess corpus
run
python preprocess.py
-
train model
run
python train.py
-
train whole Tacotron-2 with pretrained decoder
-
put wavs into one dir, the hierarchy should be:
project └───biaobei │ │ │ └───wavs │ │ 000001.wav │ │ 000002.wav │ │ ... │ │ │ └───biaobei.corpus
the minimum amount of audio data is under test...
-
update hparams.py
decoder_pretrain=False, decoder_init_checkpoint='', # Specify the location of the pre-training model, tacotron_model.ckpt-160000 restore_pretrain_decoder=True,
-
preprocess corpus
run
python preprocess.py
-
train model
run
python train.py
-
-
In our server, the default configuration will lead to OOM error...
For decrease memory usage, please follow some instructions as followed:
- In hparams.py: increase
outputs_per_step
. Max: 3, default: 1. - In hparams.py: set
clip_mels_length
asTrue
. default: False. If you still get OOM error, decreasemax_mel_frames
. default: 1000. - In tacotron/feeder.py: decrease
_batches_per_group
. default: 64, the previous version is 32.
- In hparams.py: increase
-
In hparams.py,
trim_top_db
related to the phenomenon that wav generation stops suddenly, also related to the reduction of training set duration. Recommand value: 63, default: 23. In default setting, if you inverse processed wav file, the sound will stop suddenly. This also lead to reducing training set duration. -
In hparams.py, set
cleaners
tobasic_cleaners
if you train model in mandarin. -
Multi-GPU version seems to be not accelerated.
- min training set duration: 100%: 10:/tacotron2; 75%:13:/tacotron2_share75 ;50%: 10:/tacotron2_share50;
- long sentences(clip_mels_length=False): short: 10:/tacotron2; long: 13:/tacotron_long_sentences
- min training steps: 10:/tacotron2 and 10:/tacotron2_share50 save every checkpoint files.
check if not trim_max: long senteneces
Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions
Tacotron-2
├── datasets
├── en_UK (0)
│ └── by_book
│ └── female
├── en_US (0)
│ └── by_book
│ ├── female
│ └── male
├── LJSpeech-1.1 (0)
│ └── wavs
├── logs-Tacotron (2)
│ ├── eval_-dir
│ │ ├── plots
│ │ └── wavs
│ ├── mel-spectrograms
│ ├── plots
│ ├── pretrained
│ └── wavs
├── logs-Wavenet (4)
│ ├── eval-dir
│ │ ├── plots
│ │ └── wavs
│ ├── plots
│ ├── pretrained
│ └── wavs
├── papers
├── tacotron
│ ├── models
│ └── utils
├── tacotron_output (3)
│ ├── eval
│ ├── gta
│ ├── logs-eval
│ │ ├── plots
│ │ └── wavs
│ └── natural
├── wavenet_output (5)
│ ├── plots
│ └── wavs
├── training_data (1)
│ ├── audio
│ ├── linear
│ └── mels
└── wavenet_vocoder
└── models
The previous tree shows the current state of the repository (separate training, one step at a time).
- Step (0): Get your dataset, here I have set the examples of Ljspeech, en_US and en_UK (from M-AILABS).
- Step (1): Preprocess your data. This will give you the training_data folder.
- Step (2): Train your Tacotron model. Yields the logs-Tacotron folder.
- Step (3): Synthesize/Evaluate the Tacotron model. Gives the tacotron_output folder.
- Step (4): Train your Wavenet model. Yield the logs-Wavenet folder.
- Step (5): Synthesize audio using the Wavenet model. Gives the wavenet_output folder.
Note:
- Our preprocessing only supports Ljspeech and Ljspeech-like datasets (M-AILABS speech data)! If running on datasets stored differently, you will probably need to make your own preprocessing script.
- In the previous tree, files were not represented and max depth was set to 3 for simplicity.
- If you run training of both models at the same time, repository structure will be different.
Pre-trained models and audio samples will be added at a later date. You can however check some primary insights of the model performance (at early stages of training) here. THIS IS VERY OUTDATED, I WILL UPDATE THIS SOON
The model described by the authors can be divided in two parts:
- Spectrogram prediction network
- Wavenet vocoder
To have an in-depth exploration of the model architecture, training procedure and preprocessing logic, refer to our wiki
To have an overview of our advance on this project, please refer to this discussion
since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.
first, you need to have python 3 installed along with Tensorflow.
next you can install the requirements. If you are an Anaconda user: (else replace pip with pip3 and python with python3)
pip install -r requirements.txt
We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)
We are also running current tests on the new M-AILABS speech dataset which contains more than 700h of speech (more than 80 Gb of data) for more than 10 languages.
After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.
Before proceeding, you must pick the hyperparameters that suit best your needs. While it is possible to change the hyper parameters from command line during preprocessing/training, I still recommend making the changes once and for all on the hparams.py file directly.
To pick optimal fft parameters, I have made a griffin_lim_synthesis_tool notebook that you can use to invert real extracted mel/linear spectrograms and choose how good your preprocessing is. All other options are well explained in the hparams.py and have meaningful names so that you can try multiple things with them.
Before running the following steps, please make sure you are inside Tacotron-2 folder
cd Tacotron-2
Preprocessing can then be started using:
python preprocess.py
dataset can be chosen using the --dataset argument. If using M-AILABS dataset, you need to provide the language, voice, reader, merge_books and book arguments for your custom need. Default is Ljspeech.
Example M-AILABS:
python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'
or if you want to use all books for a single speaker:
python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True
This should take no longer than a few minutes.
To train both models sequentially (one after the other):
python train.py --model='Tacotron-2'
Feature prediction model can separately be trained using:
python train.py --model='Tacotron'
checkpoints will be made each 5000 steps and stored under logs-Tacotron folder.
Naturally, training the wavenet separately is done by:
python train.py --model='WaveNet'
logs will be stored inside logs-Wavenet.
Note:
- If model argument is not provided, training will default to Tacotron-2 model training. (both models)
- Please refer to train arguments under train.py for a set of options you can use.
- It is now possible to make wavenet preprocessing alone using wavenet_proprocess.py.
To synthesize audio in an End-to-End (text to audio) manner (both models at work):
python synthesize.py --model='Tacotron-2'
For the spectrogram prediction network (separately), there are three types of mel spectrograms synthesis:
- Evaluation (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.
python synthesize.py --model='Tacotron' --mode='eval'
- Natural synthesis (let the model make predictions alone by feeding last decoder output to the next time step).
python synthesize.py --model='Tacotron' --GTA=False
- Ground Truth Aligned synthesis (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)
python synthesize.py --model='Tacotron' --GTA=True
Synthesizing the waveforms conditionned on previously synthesized Mel-spectrograms (separately) can be done with:
python synthesize.py --model='WaveNet'
Note:
- If model argument is not provided, synthesis will default to Tacotron-2 model synthesis. (End-to-End TTS)
- Please refer to synthesis arguments under synthesize.py for a set of options you can use.