AISHELL-3 is a multi-speaker Mandarin Chinese audio corpus, this repository is the acoustic model for the multi-speaker TTS baseline system described in AISHELL-3: A Multispeaker Mandarin Chinese TTS corpus (arXiv:2010.11567 [cs.SD]).
Audio samples could be found here. Dataset link on OpenSLR : openslr/93
synthesizer
, feedback_synthesizer
and dca_synthesizer
defines the model architectures used in this project, all of which are extended tacotron-2 models and share the same file structure.
synthesizer
is a plain multi-speaker tacotron-2 model, which uses 256-dimensional speaker embeddings as its speaker representation.dca_synthesizer
implements Dynamic Convolution Attention as a replacement to tacotron-2’s hybrid attetnion.feedback_synthesizer
implements speaker embedding feedback constraint on the acoustic model. The speaker encoder network used infeedback_synthesizer
is listed underdeep_speaker
.
process_audio.ipynb
is an off-line audio feature extraction script, which is used to build thedatasets
sub-directories.synthesizer_train.py
&fc_synthesizer_train.py
. We employs a two step strategy in training the baseline acoustic model: first we train a constraint-free model usingsynthesizer_train.py
, then fine-tune the pre-trained model under feedback constraint using the same hyper-parameters withfc_synthesizer_train.py
.gvector_extraction.py
is used to batch inference speaker embeddings from Mel-spectrograms.debug_syn.ipynb
shows the acoustic feature synthesis procedures using trained models.vad.ipynb
&longer_sentences.ipynb
are used to produce augmented training samples.vad.ipynb
is used to trim initial silence segments from the mel-spectrograms using a naive energy based VAD approach.longer_sentences.ipynb
produces longer training sentences by concatenating existing samples.
the datasets
directory is intended to host training dataset data, one sub-directory for each separate dataset used in the experiment. But this intention was not hard-coded into the scripts, so feel free to do whatever you want, so long as the dataset-directory
provided to the train scripts fullfills the requirements listed in the following usage notes.
A skeleton(incomplete) dataset directory is provided in the prject(datasets/aishell3
). We provide in this directory the preprocessed train-set texts(with phoneme and prosodic labels) and averaged speaker embeddings as metadata.csv
and mean_embeddings
respectively.
replace
<name>
in the following code blocks with appropriate values. detailed usage of jupyter notebooks is described in the notebooks’ markdown blocks and comment sections.
We use anaconda to manage our virtual environment. An exported conda env discription file is provided as environment.yaml
. Use conda
to create a new virtual environment in order to run the following scripts and notebooks:
$ conda env create -f environment.yaml
This will create a new conda env named aishell3
.
-
Download the pre-trained checkpoints in this repository's release page; (checkpoints for a pretrained acoustic model and speaker encoder is provided here. For the pretrained WaveRNN model used in the synthesis demo(
debug_syn.ipynb
), please see this repo for information.) -
use debug_syn.ipynb to load and inference the model
$cd deep_speaker
$CUDA_VISIBLE_DEVICES=<gpus> python train.py
-
Extract audio-features with
process_audio.ipynb
. An output directory named <dataset_name> should be specified within the notebook. (See the notebook’s content for more information). -
(Optional) use
vad.ipynb
to trim initial silence segments in the extracted mel-spectrograms. We found this preprocess procedure helps speedup model convergence. -
Extract speaker embeddings using
gvector_extraction.py
$CUDA_VISIBLE_DEVICES=<gpu> python gvector_extraction.py <path-to-dataset-dir> --gvec_ckpt=<path-to-speaker-encoder-checkpoint>
- Train base synthesizer, first set the proper batch-size and gpu-numbers in
synthesizer/hparams.py
:
# file: synthesizer/hprams.py
tacotron_num_gpus = <n_gpus>,
tacotron_batch_size = <bcsz>,
The training code supports data parallelism (samples within one logical batch are evenly spread among designated GPUs). We found that one 11G GTX1080Ti GPU could hold about 16~24 samples per batch.
$CUDA_VISIBLE_DEVICES=<gpus> python synthesizer_train.py <run-name> <path-to-dataset>
note: the directory <path-to-dataset>
should have the following sub-directories to correctly run the train script :
<dataset>
|- mels/ # generated by process_audio.ipynb or vad.ipynb
|- embeds/ # generated by gvector_extraction.py
|- train.txt # generated by process_audio.ipynb
note: Modifications to hparams.py
can also be passed to the train script using --hparams
argument.
note: the optimization process could be monitored with tensorboard. the tensorboard events are being written to synthesizer/saved_models/logs-<run_name>/tacotron_events
during the course of training.
- Train feedback synthesizer using pre-trained base synthesizer parameters. First make sure
synthesizer/hparams.py
andfeedback_synthesizer/hparams.py
uses consistent model hyper-parameters(e.g. number of Pre-net layers etc.). Then set the pre-trained checkpoint path in hparams.py
# file: feedback_synthesizer/hparams.py
restore_tacotron_path = <path-to-pretrained-tacotron-checkpoint>
restore_spv_path = <path-to-pretrained-speaker-encoder-checkpoint>
$CUDA_VISIBLE_DEVICES=<gpus> python fc_synthesizer_train.py <run-name> <path-to_dataset>