├── custom_architectures/ : custom architectures
├── custom_layers/ : custom layers
├── custom_train_objects/ : custom objects for training
│ ├── callbacks/ : custom callbacks
│ ├── generators/ : custom data generators
│ ├── losses/ : custom losses
│ ├── optimizers/ : custom optimizers / lr schedulers
├── datasets/ : utilities for dataset loading / processing
│ ├── custom_datasets/ : where to save custom datasets processing
├── hparams/ : utility class to define modulable hyper-parameters
├── models/ : main `BaseModel` subclasses directory
│ ├── siamese/ : directory for `AudioSiamese` class* used in `SV2TTS`
│ ├── tts/ : directory for Text-To-Speech models
├── pretrained_models/ : saving directory for pretrained models
└── utils/
See my data_processing repo for more information on the utils
module and data processing
features.
See my base project for more information on the BaseModel
class, supported datasets, project extension, ...
* Check my Siamese Networks project for more information
- Text-To-Speech (module
models.tts
) :
Feature | Fuction / class | Description |
---|---|---|
Text-To-Speech | tts |
perform TTS on text you want with the model you want |
stream | tts_stream |
perform TTS on text you enter |
You can check the text_to_speech
notebook for a concrete demonstration
Available architectures :
Language | Dataset | Synthesizer | Vocoder | Speaker Encoder | Trainer | Weights |
---|---|---|---|---|---|---|
en |
LJSpeech | Tacotron2 |
WaveGlow |
/ | NVIDIA | Google Drive |
fr |
SIWIS | Tacotron2 |
WaveGlow |
/ | me | Google Drive |
fr |
SIWIS, VoxForge, CommonVoice | SV2TTSTacotron2 |
WaveGlow |
Google Drive | me | Google Drive |
You can download the tensorflow
version of WaveGlow
at this link
Models must be unzipped in the pretrained_models/
directory !
Important Note : the NVIDIA
model available on torch hub
requires a compatible GPU with the correct configuration for pytorch
. It is the reason why I released pre-converted models (both Tacotron2
and WaveGlow
) in tensorflow
if you do not want to configure pytorch
! :)
You can find a demonstration on this link running on Google Colab
You can also find some audio generated in example_outputs/
or directly in the notebooks.
- Clone this repository :
git clone https://github.com/yui-mhcp/text_to_speech.git
- Go to the root of this repository :
cd text_to_speech
- Install requirements :
pip install -r requirements.txt
- Open
text_to_speech
notebook and follow the instruction !
- Make the TO-DO list
- Comment the code
- Add pretrained weights for French
- Make a
Google Colab
demonstration - Implement WaveGlow in
tensorflow 2.x
- Add
batch_size
support forvocoder inference
- Add pretrained
SV2TTS
weights - Add document parsing to perform
TTS
on document - Add new languages support
- Add new TTS architectures / models
- Add a
similarity loss
to test a new training procedure for single-speaker fine-tuning
There exists 2 main ways to enable multi-speaker
in the Tacotron2
architecture :
- Use a
speaker-id
, embed it with anEmbedding
layer and concat / add it to theEncoder
output - Use a
Speaker Encoder (SE)
to embed audio from speakers and concat / add this embedding to theencoder output
I did not test the 1st idea but it is available in my implementation.
Note : in the next paragraphs, encoder
refers to the Tacotron Encoder
part while SE
refers to a speaker encoder
model (detailed below)
The Speaker Encoder Text-To-Speech
comes from the From Speaker Verification To Text-To-Speech (SV2TTS) paper which shows how to use a Speaker Verification
model to embed audio and use them as input for a Tacotron2
model
The idea is the following :
- Train a model to identify speakers based on their audio : the
speaker verification
model. This model basically takes as input an audio sample (5-10 sec) from a speaker, embed it and compare it to baseline embeddings to decide whether the speakers are the same or not - It uses the
speaker encoder
model to produce embeddings of the speaker to clone - It makes a classical
text encoding
with theTacotron Encoder
part - It concatenates the
speaker embedding
(1D vector) to each frame of theencoder output
* - It makes a classical forward pass with the
Tacotron Decoder
part
The idea is that the Decoder
will learn to use the speaker embedding
to copy its prosody / intonation / ... to read the text with the voice of this speaker : it works quite well !
* The embedding
is a 1D vector while the encoder output
is a matrix with shape (text_length, encoder_embedding_dim)
. The idea is to concatenate the embedding
to each frame by repeating it text_length
times
However there are some problems with this approach :
- A perfect generalization to new speakers is really hard because it requires datasets with many speakers (more than 1k) which is really rare in
Text-To-Speech
datasets - The audio should be good quality to avoid creating noise in the output voices
- The
Speaker Encoder
must be good enough to well separate speakers - The
Speaker Encoder
must be able to embed speakers in a relevant way so that theTacotron
model can extract useful information on the speaker's prosody
For the 1st problem, there is no real solution except combining different datasets as I did with the CommonVoice
, VoxForge
and SIWIS
datasets
Another solution is to train a good quality model and fine-tune it with a small amount of data on a particular speaker. The big advantage of this approach is that you can train a new model really fast with less than 20min of audio from the speaker (which is impossible with a classical single-speaker model training).
For the second point, pay attention to have good quality audio : my experiments have shown that with the original datasets (which are quite poor quality), the model never learned anything
However there exists a solution : preprocessing ! My utils/audio
folder has many powerful preprocessing functions for noise reduction
(using the noisereduce library) and audio silence trimming
(which is really important for the model)
For the 2 last points, read the next section on speaker encoder
The SE part must be able to differentiate speakers and embed them in a meaningful way.
The model used in the paper is a 3-layer LSTM
model with a normalization layer and trained with the GE2E loss. The problem is that training this model is really slow and took 2 weeks on 4 GPU's in the CorentinJ master thesis (cf his github)
This idea was not possible for me (because I do not have 4 GPU's 😄 ) so I tried something else : use my AudioSiamese model ! Indeed, the objective of this model is to create speakers' embeddings and try to minimize distance between embeddings from a same speaker so, why not !
My experiments have shown 2 interesting results :
- An
AudioSiamese
trained on raw audio is quite good forspeaker verification
but embeds in a non-meaningful way forTacotron
so the result were quite poor - An
AudioSiamese
trained on mel-spectrogram (same parameters as theTacotron mel function
) is as good forspeaker verification
but seems to extract more meaningful information !
The big advantage is that in less than 1 training night you can have your Speaker Encoder
and use it which is crazy : 1 night on single GPU instead of 2 weeks on 4 GPU's !
Furthermore in a visual comparison of embeddings made by the 3-layer LSTM
encoder and my Siamese Network
encoder, they seem really similar
In order to avoid training a SV2TTS model from scratch which would be completely impossible on a single GPU, I created a partial transfer learning
code
The idea is quite simple : make transfer learning between models that have the same number of layers but different shapes*. This allowed me to use my single-speaker pretrained model as base for the SV2TTS model ! Experiments showed that it works pretty well : the model has to learn new neurons specific to voice cloning but can reuse its pretrained-neurons for speaking, quite funny !
Some ideas that showed some benefits (especially for single-speaker fine-tuning) :
- After some epochs (2-5) we can put the
Postnet
part as non-trainable : this part basically improves mel-quality but is not speaker-specific so no need to train it too much - After some epochs (5-10) you can put the
Tacotron Encoder
part non trainable (only if your pretrained model was for the same language) : text-encoding is not speaker-specific so no need to train it too much
The idea behind these tricks is that the only speaker-specific part is the DecoderCell
so we can make other parts non-trainable to force it to learn this specific part
* Note that I also implemented it when models do not have the same number of layers
You can contact me at yui-mhcp@tutanota.com or on discord at yui#0732
The objective of these projects is to facilitate the development and deployment of useful application using Deep Learning for solving real-world problems and helping people. For this purpose, all the code is under the GNU GPL v3 licence
Furthermore, you cannot use any of these projects for commercial purpose without my permission. You can use, modify, distribute and use any of my projects for production as long as you respect the terms of the licence and use it for non-commercial purposes (i.e. free applications / research).
If you use this project in your work, please cite this project to give it more visibility ! 😄
@misc{yui-mhcp
author = {yui},
title = {A Deep Learning projects centralization},
year = {2021},
publisher = {GitHub},
howpublished = {\url{https://github.com/yui-mhcp}}
}
The code for this project is a mixture of multiple GitHub projects to have a fully modulable Tacotron-2
implementation
- [1] NVIDIA's repository (tacotron2 / waveglow) : this was my first implementation where I copied their architecture in order to reuse their pretrained model in a
tensorflow 2.x
implementation. - [2] The TFTTS project : my 1st model was quite slow and had many
Out Of Memory (OOM)
errors so I improved the implementation by using theTacotronDecoder
from this github which allows theswap_memory
argument by usingdynamic_decode
- [3] Tensorflow Addons : as I had some trouble to use the library due to version issues, I copied just the
dynamic_decode()
withBaseDecoder
class to use it in theTacotronDecoder
implementation - [4] CorentinJ's Real-Time Voice cloning project : this repository is an implementation of the
SV2TTS
architecture. I do not copy any of its code as I already had my own implementation (which is slightly different for this repo) but it inspired me to add theSV2TTS
feature to my class.
Papers :
- [5] Tacotron 2 : the original Tacotron2 paper
- [6] Waveglow : the WaveGlow model
- [7] Transfer learning from Speaker Verification to Text-To-Speech) : original paper for SV2TTS idea
- [8] Generalized End-to-End loss for Speaker Verification : the GE2E Loss paper