How i can train my audio files .to use indian assent .

Question

How i can train my audio files .to use indian assent .

ash1407 opened this issue 5 years ago · 9 comments

How i can train my audio files AS data for ecoder,vocoder .to use indian assent . assent of indian is different so i do not feel having my own voice when i listen it .

Answer 1 · 2020-07-18T17:29:06.000Z

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

Does your computer have a NVIDIA GPU?
Do you have coding experience?
Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

Find a suitable dataset. Freely available resources include AccentDB (Indian accent) and VCTK (other English accents). For best results on your own voice, record your own dataset though this will take many hours.
Follow the steps in README.md to enable GPU support.
Go to the training wiki page and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.
- Review the preprocessing code and understand what it is doing.
- Understand the format of the files in the <datasets_root>/SV2TTS folder
Preprocess your dataset from step 1 to generate training data for the synthesizer.
- At a minimum, this requires editing the preprocessing scripts.
- You will likely need to write your own code to process the data into a suitable format for the toolbox.
- We do not have a tutorial for this. You are on your own here!
Continue training the pretrained synthesizer model on your dataset until it has converged.
Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.
Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

Answer 2 · 2020-07-18T20:59:23.000Z

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

Does your computer have a NVIDIA GPU?

Do you have coding experience?

Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

Find a suitable dataset. Freely available resources include AccentDB (Indian accent) and VCTK (other English accents). For best results on your own voice, record your own dataset though this will take many hours.

Follow the steps in README.md to enable GPU support.

Go to the training wiki page and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.

Review the preprocessing code and understand what it is doing.

Understand the format of the files in the <datasets_root>/SV2TTS folder

Preprocess your dataset from step 1 to generate training data for the synthesizer.

At a minimum, this requires editing the preprocessing scripts.

You will likely need to write your own code to process the data into a suitable format for the toolbox.

We do not have a tutorial for this. You are on your own here!

Continue training the pretrained synthesizer model on your dataset until it has converged.

Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.

Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

I will give a try. Thanks for the guidance friend.

Answer 3 · 2020-07-23T17:11:34.000Z

@ash1407 Are you still trying? When you get to step 4 (synthesizer preprocessing on new dataset), pull the latest master. The #441 changes should make this step a lot easier.

If using AccentDB, will you finetune a single accent or just throw them all into the mix? It would be interesting to find out if this is enough voices to generalize well for cloning. Also see my latest reply in #437 , it is a promising result to see the synthesizer acquire the accent after a small number of steps (with the caveat that I finetuned with data from a single speaker).

Answer 4 · 2020-07-23T18:19:21.000Z

@ash1407 Are you still trying? When you get to step 4 (synthesizer preprocessing on new dataset), pull the latest master. The #441 changes should make this step a lot easier.

If using AccentDB, will you finetune a single accent or just throw them all into the mix? It would be interesting to find out if this is enough voices to generalize well for cloning. Also see my latest reply in #437 , it is a promising result to see the synthesizer acquire the accent after a small number of steps (with the caveat that I finetuned with data from a single speaker).

I was not having Nvidia GPU , any idea which Gpu i should purchase for Machine learning (i have budget of 4oooRS INR)

Answer 5 · 2020-07-23T20:44:10.000Z

@ash1407 My fine-tuning in #437 is done using CPU only, and the models are converging quickly enough. Do not get a GPU unless you find it to be much too slow.

Answer 6 · 2020-07-26T07:39:43.000Z

So I've got some good news and bad news.

Bad news first: In #437 (comment) I mention trying to add an accent using the VCTK dataset and it does not generalize to all speakers. You need to train a synthesizer from scratch to impart an accent with zero-shot cloning.
Good news: If you only require a single speaker you can finetune a model in a matter of hours on CPU. (You also need to prepare the dataset, with recordings and text file transcripts, and preprocess them.) Here are my latest results and an example to follow: #437 (comment)

Answer 7 · 2020-07-29T15:52:32.000Z

@ash1407 If you're not working on this actively then I'll close the issue for now. Reopen it when you're ready to give it a try.

Answer 8 · 2022-03-10T10:57:17.000Z

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

Does your computer have a NVIDIA GPU?

Do you have coding experience?

Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

Find a suitable dataset. Freely available resources include AccentDB (Indian accent) and VCTK (other English accents). For best results on your own voice, record your own dataset though this will take many hours.

Follow the steps in README.md to enable GPU support.

Go to the training wiki page and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.

Review the preprocessing code and understand what it is doing.

Understand the format of the files in the <datasets_root>/SV2TTS folder

Preprocess your dataset from step 1 to generate training data for the synthesizer.

At a minimum, this requires editing the preprocessing scripts.

You will likely need to write your own code to process the data into a suitable format for the toolbox.

We do not have a tutorial for this. You are on your own here!

Continue training the pretrained synthesizer model on your dataset until it has converged.

Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.

Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

Is anyone got result for training indian assent? please let me know

Answer 9 · 2023-06-24T13:24:57.000Z

This is not an easy undertaking so before you start, make sure you satisfy the prerequisites. You must be able to answer "yes" to all questions below:

* Does your computer have a NVIDIA GPU?

* Do you have coding experience?

* Are you willing to devote at least 20 hours to the task?

I have not gone through the process myself, but I'll try to outline it since we don't have a good explanation. What you need to do is to fine-tune the pretrained synthesizer and vocoder models on a suitable dataset.

1. Find a suitable dataset. Freely available resources include [AccentDB](https://accentdb.org/) (Indian accent) and [VCTK](https://datashare.is.ed.ac.uk/handle/10283/3443) (other English accents). For best results on your own voice, record your own dataset though this will take many hours.

2. Follow the steps in [README.md](https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/README.md) to enable GPU support.

3. Go to the [training wiki page](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training) and follow the steps for the synthesizer and vocoder training on the LibriSpeech dataset.
   
   * Review the preprocessing code and understand what it is doing.
   * Understand the format of the files in the <datasets_root>/SV2TTS folder

4. Preprocess your dataset from step 1 to generate training data for the synthesizer.
   
   * At a minimum, this requires editing the preprocessing scripts.
   * You will likely need to write your own code to process the data into a suitable format for the toolbox.
   * **We do not have a tutorial for this. You are on your own here!**

5. Continue training the pretrained synthesizer model on your dataset until it has converged.

6. Using your new synthesizer model, preprocess your dataset to generate training data for the vocoder.

7. Continue training the pretrained vocoder model on your dataset until the output is satisfactory.

With luck, your trained models will now generalize to your voice and impart the desired accent. There are no guarantees this will work.

If you succeed, please share your models and I will add them to the list in #400.

Hi, I have looked up on your comments and I need to clone my own voice with ascent so I produce it with text. Can you share step by step direction. I also open an issue #1228