CorentinJ/Real-Time-Voice-Cloning

Update on maintaining this project

CorentinJ opened this issue Β· 47 comments

We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

First things first, the biggest issue for me with this project is the hecking tensorflow code. Tensorflow sucks, and it sucks just as much to install it let alone install an older version.

I believe it would lower the entry barrier for new users if the version of that package were to be upgraded. I've seen a PR for that but that's only for the collab version it seems. A PR for the entire repo would be appreciated.

Ideally, we'd replace all of the synthesizer code with pytorch code (there are several open source pytorch synthesizers out there), but that's a lot of work.

If anybody is willing to pick up on either of these things, let me know.

Second thing: webrtcvad. That package is hell to install on windows. There are alternatives for noise removal out there. There's also the possibility of not using it at all, but for both LibriSpeech and LibriTTS I would recommend it.

I'd like #331 merged to enable CPU support by default. It also simplifies the install process for those with a goal of running demo_cli.py for evaluation purposes.

Some kind of API or improved CLI would be a worthwhile and easy enhancement for the community to pursue. Good usability will help keep this repo as the focal point for development of open-source SV2TTS. This is really neat stuff, many thanks for sharing your code and pre-trained models under a permissive license.

I'll give a review to #331 tomorrow and probably will make some changes as well.

Thank you for reviewing #331. In response I have submitted #366 which addresses your comments and carefully removes all unnecessary changes from the PR. When you have time please review and merge that one instead.

Very nice, do you have the permission to squash & merge after I approved it?

It does not give me that option. Would you like me to squash and merge on my own fork and submit a new PR?

Opened #375 to propose a workaround for webrtcvad.

I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.

Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day.

It would be awesome if you could point out some alternatives, maybe people would start using them instead. I'm not knowledgeable at all in this field so I don't know how to find anything on my own and how to compare which repos are good and which ones work best for what.

I think having a load and click free GUI app is the appeal of your software.

I think having a load and click free GUI app is the appeal of your software.

Yeah, I can't say I expected it to have that big of an impact on the popularity of this repo when I wrote it. Too bad it only looks easy, but still is out of reach for most people with little experience in programming.

Prior to becoming my colleague, fatchord wrote not only WaveRNN but also a Tacotron 1 implementation (which, by the way, is not proved inferior to Tacotron 2): https://github.com/fatchord/WaveRNN

NVIDIA has a Tacotron 2 implementation: https://github.com/NVIDIA/tacotron2

Mozilla as well, with more frequent updates & features: https://github.com/mozilla/TTS

I would also check paperswithcode.com and ignore my repo and the ones above if you're looking for something else; perhaps something more recent, as neural TTS is still very much growing. https://paperswithcode.com/task/text-to-speech-synthesis

Hi, @CorentinJ. This is a fantastic project which I've had a lot of fun playing around with.

The biggest challenge with using other projects seems to be data sets. All the other projects I've found are most easily trained on the LJSpeech data set whereas this one can generate unique results with a small sample of audio. Are you aware of any other projects that can be used to clone speech with small audio samples? Thanks!

@cantrell You've got to understand the way voice cloning works in this repo. The Tacotron 2 architecture in my repo barely differs from the usual Tacotron 2. The only thing that's added is a way to condition it on a speaker's voice, which is a very minor addition. Hence why it should be simple to transfer that over to an existing Tacotron 2 implementation. The Mozilla repo has ongoing (or maybe finished?) work on that, so that's one alternative.

Do understand that it's not a matter of training the model on only 5 seconds of audio, it's an entirely different procedure which does not involve any training.

Got it. Thanks, @CorentinJ. I'll take a closer look (and/or check out the Mozilla implementation).

@CorentinJ @cantrell Can you guys take a look our recent TTS framework here (https://github.com/TensorSpeech/TensorflowTTS). We supported Tacotron2, FastSpeech, FastSpeech2, Multiban-melgan on native Tensorflow implementation. We also have a plan to support other languages, tflite for mobile, tensorrt for Deploy server. Almost supported model are real-time now.

audio samples: https://tensorspeech.github.io/TensorflowTTS/
colab demo: https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing

I can make pull request if you want :D.

@dathudeptrai cool, do go ahead, but remember that you'll have to ensure that the data compatibility between wavernn and the synthesizer must be held, and that you will have to provide new pretrained weights for both these models.

@CorentinJ I think it's not hard to convert pretrained tacotron2 here to my tensorflow2 implementation since my implementation based on the tacotron2 code used here.

@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it?

Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment)

@CorentinJ can you please take a quick look at #227 (synthesizer produces large gaps when processing very short texts) and give us a clue where that issue might be coming from, or where to start if we want to fix it?

Edit: @macriluke says it results from the training dataset. Is it really because the models are trained on medium to long utterances? #291 (comment)

I was going off of this bit of the thesis:

The prosody is however sometimes unnatural, with pauses at unexpected locations in the sentence, or the lack of pauses where they are expected. This is
particularly noticeable with the embedding of some speakers who talk slowly, showing
that the speaker encoder does capture some form of prosody. The lack of punctuation
in LibriSpeech is partially responsible for this, forcing the model to infer punctuation
from the text alone. This issue was highlighted by the authors as well, and can be
heard on some of their samples of LibriSpeech speakers. The limits we imposed on
the duration of utterances in the dataset (1.6s - 11.25s) are likely also problematic.
Sentences that are too short will be stretched out with long pauses, and for those that
are too long the voice will be rushed.

It looks like maybe I made the wrong assumption of the meaning of the word "pauses" here, as I see in #53 It's mentioned that this is an issue introduced through the code.

EDIT: I will say that while the wooshing and long pauses aren't this common on other pretrained tacotrons, I have heard them on mid-training evaluations of different synthesis models, so the real cause could potentially be both training and code here.

This issue of large gaps is something that also occurred at Resemble.AI, and that I have worked on and fixed. It's a serious amount of work, I'll give you the big lines:

  • Use LibriTTS instead of LibriSpeech in order to have punctuation.
  • LibriTTS needs to be curated of speakers with bad prosody.
  • You can lower the upper bound I put on utterance duration, which I suspect has for effect of removing long utterances that are more likely to have more pauses (I formally evaluated models trained this way to generate less frequent long pauses). It also trains faster and does not have drawbacks (with a good attention paradigm, the model can generate sentences longer than seen in training).
  • The attention paradigm needs to be replaced, forward attention is poor.

@CorentinJ
It's a pity that you decide no to update this project any more.
I have followed your work since latter half of 2019.
For the encoder part, I removed the Relu Activation function of the last linear layer and train with 18k speakers(Chinese+English) for about 2~3 month. I using "Resemblyzer-master" tool to analysize the embedding generated by the model as well as my own tool. I guess the encoder is ready.
For the systhsizer, Can your help me and give me some advices?

  1. My target lanuage is Chinese, I did not have enough TTS corpus to train the synthesizer, only asr corpus can be found. For example , the aishell, but the quality is not so good. Do you have any suggesion to preprocess the wavs ?
  2. When giving a target wav, the end2end systhesized wavs have some characteristic of the target timbre, but they are just similar in a low level. Do you have any suggesion to improve the similarity? How is your result? Could you share some of your best result?

@Liujingxiu23 I don't have any suggestion regarding your data. As for the audio quality, you can improve it by finetuning both tacotron and the vocoder on a single speaker. To improve the quality of voice cloning in general, there's a lot more working, starting with the list I gave above.

Dear @CorentinJ , thank you for your amazing work and you continued support here. I have a few questions:
a) Would you still apply denoising to LibriTTS? I find that the samples are high quality, and the data itself has already been cleaned.
b) Can i train on both LibriTTS and VCTK? If so, what should i look out for?
c) When training speaker encoder (SE), i find that there is a difference in the difficulty of the datasets: VCTK, LibriTTS, Mozilla Commonvoice are 'easy' for the SE, and it achieves low loss and low EER quickly. However, VoxCeleb{1,2} are much harder.
-> Should i train on each data set separately, and once the model has 'trained out' on the easier datasets, skip them in favor of more iterations on voxceleb?

a) Yes I would. For having manually curated LibriTTS myself, I can definitely say that a lot of speakers are very noisy. Do a little data exploration to convince yourself of that: pick 100 random samples and listen to all of them. There are still many issues with this improved version of LibriSpeech: inconsistent volume, background noise, poor mic quality, mic bumps, ... Regarding denoising alone, here's a sample from LibriTTS and its denoised version:
https://puu.sh/G839T.wav
https://puu.sh/G839Y.wav

b) Yes you can, some gotchas:

  • Ensure that your preprocessed data is sampled to the same sample rate
  • Ensure you normalize volume
  • Beware of balance: compare the size of LibriTTS vs that of VCTK and compare the number of speakers. You might need to prune away some data from LibriTTS

c) I don't know if it's worth the effort. The voice encoder is a nice example of "throw more resources at it and it'll keep improving", if you merge your datasets (although again, balance might be an issue given the size of voxceleb) and train for long enough it should perform well anyway.

Thank you so much for your answer. Two follow-up questions:
a) Why would dataset balance be an issue? Assume i have 10 times more samples from LibriTTS than from VCTK - if the input format, sampling rate and preprocessing is the same, why should this imbalance matter? (provided the clips are of somewhat same quality w.r.t to noise). Same for for SP.
b) You mentioned manually curating LibriTTS. Could you elaborate what you did in a bit more detail? Are there any papers, tools, etc. you can point me to? Did you listen to all audiofiles? (I cannot imagine this)

Again, than you so much for your answers. At my university (Munich, Germany), nobody is doing speech synthesis - i'm a bit on my own here.

a) It's a matter of what you want. If you want to reach VCTK quality, then LibriTTS samples vastly outnumbering VCTK samples is going to cancel that out due to sampling being uniform. In a classical multispeaker model with a speaker table (i.e. an embedding layer), it would still make sense to have a 10 to 1 ratio if your goal was only to encode a voice for these speakers in the speaker table;

b) I can't elaborate too much, no. Just know that some of the data is of poor quality, and some is great. A bit of data exploration should give you an idea.

You said:
there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day.
Which one are better? can you name them?

You said:
there are many other better open-source implementations of neural TTS out there, and new ones keep coming every day.
Which one are better? can you name them?

Prior to becoming my colleague, fatchord wrote not only WaveRNN but also a Tacotron 1 implementation (which, by the way, is not proved inferior to Tacotron 2): https://github.com/fatchord/WaveRNN

NVIDIA has a Tacotron 2 implementation: https://github.com/NVIDIA/tacotron2

Mozilla as well, with more frequent updates & features: https://github.com/mozilla/TTS

I would also check paperswithcode.com and ignore my repo and the ones above if you're looking for something else; perhaps something more recent, as neural TTS is still very much growing. https://paperswithcode.com/task/text-to-speech-synthesis

Also when I said:

and new ones keep coming every day

I actually meant that new papers keep coming every day, not open source implementations (sadly)

Mozilla TTS has a PR implementing SV2TTS: mozilla/TTS#472

Have not tried it yet, but if/when it outperforms this one, it would be a good project to port the toolbox UI and maybe the preprocessing scripts over to that implementation. (I will not be the one to do it, but putting it out there as an idea.)

If the PR for SV2TTS on Mozilla's TTS had been submitted 2 weeks earlier, I would have abandoned my effort on the pytorch synthesizer. But since I'm already so far along on #472 (and having learned much along the way), I will try and bring it to completion. It should improve the longevity of this repo and make it a lot more maintainable in the long run.

My colleague @fatchord has published his new vocoder paper: https://arxiv.org/abs/2008.02493

Samples: https://resemble-ai.github.io/hooligan_demo/

@CorentinJ I noticed the updates to the README telling potential users to find a different repo. Has anything changed regarding your intentions for this repo? Do you still want the pytorch synthesizer?

If tensorflow is entirely removed from this repo, I will change that message for sure.

I still get a lot of feedback from people who spent hours trying to set things up.

We're one year after the initial publication of this project. I've been busy with both exams and work since, and it's only last week that I passed my last exam. During that year, I have received SO many messages from people asking for help in setting up the repo and I just had no time to allocate for any of that.
I kinda wished that the popularity of this repo would have died down, but new people keep coming in at a fairly constant rate.
I have no intentions to start developing on this repo again, but I hope I can answer some questions and possibly review some PRs. Use this issue to ask me questions and to bring light upon things that you believe need to be improved, and we'll see what can be done.

You wanted the popularity of this repo to go down because you couldn't handle the requests? That's kinda absurd, people's interest are good thing, more developers means less work on one person's shoulders. ;)

@CorentinJ

If tensorflow is entirely removed from this repo, I will change that message for sure.

I still get a lot of feedback from people who spent hours trying to set things up.

In my opinion,
It is actually not that hard to setup on Ubuntu.
On windows... well... good luck. (for now)

I hope this will help reduce complaints :

WIKI

Installation - Ubuntu-20.04

Installation - Windows-10 TODO

I want to implement something like this for voice-to-voice. Basically, I want to record a voice and then use this as a basis for masking N voices, where N >> 1. Some questions:

  1. If you're planning to work on a serious project, my strong advice: find another TTS repo.: @CorentinJ , would this comment still apply if I don't need the part that reads and creates audio from a given text?
  2. I understand that the impressive part of this repo is that it can clone a voice given only 5 seconds of audio, but in general does the output improve with training on more (and more diverse) data? What I wanted to have a professional speaker record hours of data to serve as input audio - would the output improve in quality?

@CodingRox82 Hi,
if you are seriously interested in Voice to Voice / Voice changer / Voice Transfer / "insert any other description that involves converting the audio from 1 speaker to another without passing through TTS";

Would you be interested in joining a small group with common interest?
We are currently working on creating a polished dataset.
Our small group have different but overlapping interests for the good of this repo and others that can provide voice to voice, bypassing TTS.

If you are interested, leave a comment in #474

@CorentinJ Thanks for providing the statement of direction in #543 (comment)

In that context it's not worth my time continuing to provide technical support as I have the last few months. It was initially helpful to identify common pain points but now it's mainly down to getting rid of tensorflow, and people asking for an exe. To help potential developers I suggest disallowing the use of the issues board for tech support and requests for help with projects since it dilutes the development effort. I've donated a lot of my time trying to build some sense of community, but unfortunately it is not attracting and retaining the type of people who can push this project forward.

Tensorflow has this issue policy, and it could help to implement something similar. I realize this will be unpopular because a lot of individuals want help and tech support, but it needs to be understood that you get what you pay for with open source.

If you open a GitHub Issue, here is our policy: 1. It must be a bug/performance issue or a feature request or a build issue or a documentation issue (for small doc fixes please send a PR instead). 2. Make sure the Issue Template is filled out. 3. The issue should be related to the repo it is created in.

Here's why we have this policy: We want to focus on the work that benefits the whole community, e.g., fixing bugs and adding features. Individual support should be sought on Stack Overflow or other non-GitHub channels. It helps us to address bugs and feature requests in a timely manner.

Sorry for the late reply @blue-fish . I'm definitely interested in using this. I like your idea of creating a pre-compiled version to give people to test out. I'm going to start tinkering around with this to try to get it to work and if I find the time to learn how to create a distributable precompiled version I'll give it a shot.

@blue-fish Thanks a lot for your valuable help and time. I did come to the same conclusions as you. A lot of the users coming through are highly unexperienced.

I have been wanting to make things simpler just for the sake of reducing the number of technical support requests, but my awkward position makes it hard for me to stay involved.

Let me know if I'm up to date on this-

  • blue-fish finished the effort to implement and train in pytorch in his fork.

  • on review it was decided that the quality of the tensorflow model was still better overall quality.

  • sometimes with the tensorflow model the stop token prediction fails and results in large gaps in the synthesis.

  • sometimes with pytorch model will quit in the middle of synthesis, something to do with the attention model?

The stop token prediction (whether the model knows when to end the generation) on the tensorflow model is usually good, the long pauses is more of a dataset/data representation and attention mechanism issue.

The pytorch model is the one to fail at predicting stop tokens - indeed due to its attention mechanism - and hence why it stops during generation.

Ah okay I had it almost exactly backwards.

So following blue-fish's instructions in #538 to retrain tensorflow on libri-tts/libri-Speech should resolve the long pauses and also won't have the stop token issue?

Recent similar projects:
https://github.com/Tomiinek/Multilingual_Text_to_Speech
https://github.com/espnet/espnet

Can I also clone voices with these repo's using a small audio clip of 3-5 minutes? This repo needs a 5 second audio clip, but for resemable.ai a larger sample with voice is better. Now resemable ask voice verification, something I can't do.

Are there repo's that can also use a longer voice sample of, for example, 5 minutes, that sound better than this repo? if so, which ones have the best result?

I would like to pay the person who can help me make good voice clones from 3-5 minute samples. really need it. blue-fish, I see you're very active here. Help me? :)

May you add some maintainers to the repo, create an announcement and ask for help. It happened before with others repositories

That's a very good work, congrats.
I don't know if I'm a the good place to post this but it give an american accent to the cloned voice although the speaker I want to clone have a British accent, is it the encoder, the synthesizer, the vocoder or the three ? Is there a way to change this without having a Nvidia Gpu to train the models ? Or is there already models trained with British accent available ?
Also I noticed the pronunciation is wrong sometimes and it even miss totally some words, is there a way to change this ? Maybe it's due to the ponctuation no taken in account ?