TensorSpeech/TensorFlowTTS

Fine-Tuning with a small dataset

OscarVanL opened this issue · 127 comments

Hello!

I'm trying to evaluate ways to achieve TTS for individuals that have lost their ability to speak, the idea is to allow them to regain speech via TTS but using the voice they had prior to losing their voice. This could happen from various causes such as cancer of the larynx, motor neurone disease, etc.

These patients have recorded voice banks, a small dataset of phrases recorded prior to losing their ability to speak.

Conceptually, I wanted to take a pre-trained model and fine-tune it with the individual's voice bank data.

I'd love some guidance.

There are a few constraints:

  1. The patient-specific data bank is not a large dataset, it's approximately 100 recorded phrases.
  2. Latency must be low, we hope for real-time TTS. Some approaches use a pre-trained model followed by vocoders, in our experience, this has been too slow, with latencies of about 5 seconds.
  3. The trained model must work on an Android app (I see there is already an Android example, which has been helpful)

I'd love your guidance on the steps required to achieve this, and any recommendations on which choices would give good results...

  • Which model architectures will tolerate tuning with a small dataset?
  • The patients have British accents, whereas most pre-trained models have American accents. Will this be a problem?

Do you have any tutorials or examples that show how to achieve a customised voice via fine-tuning?

@OscarVanL Hi, great idea :D. I have some guidances for you to customized voice via fine-tuning bellow:

  • About latency, fastspeech2 + mb-melgan is enough for you in this case, it can run in real-time on mobile devices with a good generated voice.
  • You can use a LJSPeech pretrained model to fine-tune on ur patient-specific data. Sice ur dataset is small (100 recorđe phrases) then there are many words missing so you just need fine-tune speaker-embedding layers and add some FC layers in the end of FastSpeech2 (you can also fine-tune PostNet in FastSpeech2) model to let model transfer the American accent to British accent. I will make a PR to let the model training only some layers rather than all layers :D.
  • About Mb-melgan, you can train on a larger dataset with many speakers to achieves a universal vocoder so you can use this universal version for your FastSpeech2 without fine-tuning.

@ZDisket can you share some ur experiences when finetune a voice from female->male in ur small dataset :D .

@OscarVanL @dathudeptrai
FastSpeech2 is definitely the right architecture, it's very tolerant of small datasets (my guess is because it doesn't have to learn to align them); I've had success finetuning on even 80 seconds of audio, although that was female -> female, but there shouldn't be a problem with male voices, which I've also had success on.
Although I've had little success when finetuning mb-melgan, as there is always a lot of loss or background noise (which is why I integrated RNNoise into my frontend), so universal vocoder is the way to go.

Wow, thank you both for the detailed replies. That's really helpful!

@dathudeptrai Thank you for offering to make a PR to help train selected layers.

@ZDisket It's great to hear your success even with a limited dataset. Fortunately we have much more than 80 seconds of audio even in the worst cases.

Could you explain the idea of a universal vocoder to me? How is it possible to get a customised voice using a universal vocoder without fine tuning?

This is all very new to me, but very exciting.

@OscarVanL Conventional text2speech works with a text2mel model, which converts text to spectrograms, and vocoder, which turns spectrograms into audio. Training a vocoder on many, many different voices can achieve a "universal vocoder" which can adapt to almost any speaker. I know the owner of vo.codes uses a (MelGAN) universal vocoder. You'll still have to finetune the text2mel though.

Thank you for the explanation.

So my understanding is that I will have to train a FastSpeech2 text2mel model to create patient-specific mel spectrograms. This will involve me taking a LJSpeech pretrained model, then fine-tuning as described by @dathudeptrai with patient voice data.

After this, are there pre-trained MelGAN Universal Vocoders available to download that have already been trained on many voices, or is this something I would need to do myself?

Finally, are Universal Vocoders tied to a specific text2mel architecture (Tacotron, FastSpeech, etc), or can a Universal Vocoder take any mel spectrogram generated by any text2mel architecture?

@OscarVanL

After this, are there pre-trained MelGAN Universal Vocoders available to download that have already been trained on many voices, or is this something I would need to do myself?

There are three MelGANs: regular MelGAN (lowest quality), ditto + STFT loss (somewhat better), and Multi-Band (best quality and faster inference), you can hear the differences in the demo page. There's also ParallelWaveGAN, but it's too slow on CPU to consider.

As for pretrained models, there are none trained natively with this repo on large multispeaker datasets(I have two trained on about 200 speakers, one 32KHz and other 48KHz, but it doesn't work well outside of them), but there are notebooks to convert trained models from kan-bayashi's repo: https://github.com/kan-bayashi/ParallelWaveGAN (which has a lot) to this one's. I forgot where they were, so you'll have to ask @dathudeptrai.

Finally, are Universal Vocoders tied to a specific text2mel architecture (Tacotron, FastSpeech, etc), or can a Universal Vocoder take any mel spectrogram generated by any text2mel architecture?

A mel spectrogram is a mel spectrogram no matter where it comes from, so yes, as long as the text2mel and vocoder's data is processed the same (same normalization method, mel frequency range, etc).

Thank you once again for helping with my noob questions! I'll definitely check out that resource with trained models.

that's interesting subject,
Any example how to fine tune mb-melgan with a pretraind model? in readme it only says Just load pretrained model and training from scratch with other languages. can you explain more?
thanks

@OscarVanL i just make a PR for custom trainable layers here (#299).

@Zak-SA You can try to train universal vocoder or load the weight from pretrained model list then training as normal (follow README.).

Amazing, thank you to both of you for going above and beyond to help!

A few more questions as I didn't see any documentation on preparing the dataset, I'm looking to prepare some data for fine-tuning.

Do I need to strip grammar from the text? Eg: ()`';"-

Are there any other similar cases I should consider when preparing the transcriptions?

Does the audio filetype matter? I have 44100Hz Signed 16-bit PCM WAVs. (Edit: These files produced no errors during preprocessing/normalisation, but they should be mono, not stereo)

Some early observations going through the steps in examples/mfa_extraction/README.md and examples/fastspeech2_libritts/README.md with my own dataset...

  • Your dataset should be in mono, or else during one of these steps the script will fail.

  • Your dataset should not use dashes in the name. My dataset was named as audio-1.wav, audio-2.wav. In fix_mismatch.py this will cause the script to fail.

  • The sampling rate will automatically be down-sampled from 44100MHz to the required 24000Hz.

  • 16-bit PCMs are fine.

  • Audio clips should not exceed 15 seconds in duration, or you will run out of memory when training the model.

Hi,

I've begun fine-tuning with the guidance given by @dathudeptrai :)

I've taken the LJSpeech pretrained model "fastspeech2.v1" to fine-tune.

I took the fastspeech2.v1.yaml config (designed for LJSpeech dataset), and made only one change, I set var_train_expr: embeddings from the PR dathudeptrai made. I was unsure what other hyperparameters to change.

Here you can see the TensorBoard results for training the embedding layers...
image

Using the fastspeech2_inference notebook, followed by the multiband_melgan_inference notebook using the libritts_24k.h5 Universal vocoder I got these results...

At 5000 steps: audio, spectogram

At 15000 steps: audio, spectogram

At 80000 steps: audio, spectogram

Obviously, this sounds bad because I have only trained embedding layers.

I would now like to add some FC layers at the end, as you suggested, but am not sure how I do this.

Based on my tensorboard results, how many steps do you think I should tune the embedding layers before I stop and begin to train the FC layers?

Do you advise making any changes to the hyperparameters in fastspeech2.v1.yaml?

@OscarVanL can you try to train all network ?, var_train_expr: null and report the tensorboard here then I can give you the right way to go :D.

@dathudeptrai Here's my tensorboard with 120k steps with train_vr_expr: null.
image

@dathudeptrai Here's my tensorboard with 120k steps with train_vr_expr: null.
image

ok i can understand what is the problem with ur dataset :D. I want you try to train the model with var_train_expr: "speaker|embeddings|f0_predictor|energy_predictor|duration_predictor|f0_embeddings|energy_embeddings|mel_before|postnet|decoder/layer_._0", that mean you should fine-tune:

  1. all speaker specific layer (because ur dataset has a different speaker with a pre-trained model).
  2. phoneme embeddings (note that ur pre-trained model is used character rather than phoneme so you should train the phoneme embeddings from scratch).
  3. F0/energy/duration should be retrained also because this is a speaker characteristic.
  4. mel_before/postnet should and first layer of decoder should retrained also.

I do not know if it work or not because the pretrained u are using is charactor-based, you should find phoneme pretrained then you do not need fine-tune phoneme embeddings. @ZDisket do you have any FS2 phoneme pretrained ?

Ok I will try that now 👍 Thanks!

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

I wanted to ask a question about mfa duration...

My recordings are 44100Hz. For txt_grid_parser --sample_rate, do I use 44100 or 24000? Because later on the preprocessing stage downsamples to 24000, but txt_grid_parser is ran before downsampling.

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

I wanted to ask a question about mfa duration...

My recordings are 44100Hz. For txt_grid_parser --sample_rate, do I use 44100 or 24000? Because later on the preprocessing stage downsamples to 24000, but txt_grid_parser is ran before downsampling.

I think it is 44100 but we may need ask @machineko

Either methods should works u just need to later change sample rate for calculation in preprocessing ( downsampling first should works better but results shouldnt be noticable as small diff in durations shouldnt affect fs2 according to paper )

Either methods should works u just need to later change sample rate for calculation in preprocessing ( downsampling first should works better but results shouldnt be noticable as small diff in durations shouldnt affect fs2 according to paper )

1 vote for downsampling first :))) @OscarVanL

I agree. I think downsampling first will avoid any confusion or mistakes.

@dathudeptrai I have two phoneme LJSpeeches, 22KHz and (upsampled) 24KHz with LibriTTS preprocessing settings like in kan-bayashi repo. But the phoneme IDs might differ

OK, I have downsampled to 24000Hz, redone all of the mfa extraction, preprocessing, normalisation, and changed hop_size to 240. I am training the layers you suggested. I will update you with a new tensorboard tomorrow :) Thank you for all your comments.

@dathudeptrai Here's my TensorBoard for that last attempt.
image

@dathudeptrai Here's my TensorBoard for that last attempt.
image

the model overfits too much, in this case, i think you should pretrained ur model by libriTTS dataset then you do not need retrained embeddings layers. Seems in ur validation data, there are many words/phoneme that the model has not seen in the training data (you can check this statement), that is why the valid loss increase why training loss decrease.

@dathudeptrai

the model overfits too much, in this case, i think you should pretrained ur model by libriTTS dataset then you do not need retrained embeddings layers. Seems in ur validation data, there are many words/phoneme that the model has not seen in the training data (you can check this statement), that is why the valid loss increase why training loss decrease.

Thank you. I will try to record some more phrases using more common words.

Curreny I am using 15 minutes of recordings I made while reading a book, maybe the words used are not diverse enough. (The book is scientific so uses some rare words, so maybe these do not match the training data well.)
I will try reading one of the same books used in the libriTTS dataset, and some common phrases.

When your say to train embeddings with libriTTS, is that to solve the problem you mentioned before?

phoneme embeddings (note that ur pre-trained model is used character rather than phoneme so you should train the phoneme embeddings from scratch).

So, I pretrain the LJSpeech fastspeech2.v1 to first correct the phoneme embeddings using libriTTS, then after I fine tune with my dataset.

@dathudeptrai

the model overfits too much, in this case, i think you should pretrained ur model by libriTTS dataset then you do not need retrained embeddings layers. Seems in ur validation data, there are many words/phoneme that the model has not seen in the training data (you can check this statement), that is why the valid loss increase why training loss decrease.

Thank you. I will try to record some more phrases using more common words.

Curreny I am using 15 minutes of recordings I made while reading a book, maybe the words used are not diverse enough. (The book is scientific so uses some rare words, so maybe these do not match the training data well.)
I will try reading one of the same books used in the libriTTS dataset, and some common phrases.

When your say to train embeddings with libriTTS, is that to solve the problem you mentioned before?

phoneme embeddings (note that ur pre-trained model is used character rather than phoneme so you should train the phoneme embeddings from scratch).

So, I pretrain the LJSpeech fastspeech2.v1 to first correct the phoneme embeddings using libriTTS, then after I fine tune with my dataset.

yes, the key to fine-tune TTS on a small dataset is that you should train a good phoneme/character embeddings layers. So when you fine-tune, the model just need to learn speaker accent. The keyword for ur problem is voice cloning, you can refer this paper (https://arxiv.org/pdf/1806.04558.pdf). whatever your approach (fine-tune TTS or voice-cloning), you still need to pretrained ur model by large multi-speaker dataset (Such as LibriTTS).

You can also mix ur dataset with other speakers and it should learn fine so u dont need to later fine tune models 😁

You can also mix ur dataset with other speakers and it should learn fine so u dont need to later fine tune models 😁

I don't understand how this would work. I am trying to make a TTS model that is specific to an individual's voice (voice cloning). If I train only on a mixture of voices, how will this become specific to one voice? Surely there will always need to be some fine tuning?

You can also mix ur dataset with other speakers and it should learn fine so u dont need to later fine tune models grin

I don't understand how this would work. I am trying to make a TTS model that is specific to an individual's voice (voice cloning). If I train only on a mixture of voices, how will this become specific to one voice? Surely there will always need to be some fine tuning?

you can train all in one :)). We have speaker embeddings :D. after training you just need to pass speaker id into the model :D

So just to clarify that this should be my next attempt:

  1. Take pre-trained LJSpeech model (fastspeech2.v1) from examples.

  2. Add the dataset for the voice I wish to clone to LibriTTS dataset as another speaker.

  3. Continue training ontop of pre-trained LJSpeech model, training all layers, using --config fastspeech2libritts.yaml and --dataset_config libritts_preprocess.yaml

  4. When doing inference, pass the speaker ID for the voice I want to clone to the inference function.

Or, should I train from scratch and skip step 1? How long does this take? I have an RTX 2070 8GB.

Furthermore, in fastspeech2libritts.yaml, do I need to change n_speakers? I see that the in my Libritts download train-clean-100 there are 247 speakers (248 after adding in my voice to clone).

@OscarVanL You can train on a full dataset but I found it a lot easier and faster on a smaller subset with longer audio samples per speaker.

If u want to follow 20 speakers route (20 speakers with more then 20min of audio recorded) -> https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2_libritts/libri_experiment/prepare_libri.ipynb

For 248 speakers on 2070 u'll probably need about 3+ days of training (~25 speakers [20 libri and 5 mines] on RTX 2080ti take me about 14h for good coverage still not optimal).

It's always better starting from pretrained model it's just easier to train.

I can also train your voicebank dataset + libri for 1/2 days on my RTX 2080ti if u send me prepared scripts for training and dataset so I don't need to do anything more :)

@machineko
Thanks for the tips, that makes a lot of sense why n_speakers is preconfigured at 20.

So you suggest to take the libritts dataset, pick 20 speakers each with >20 mins audio, then add my dataset (approx 25 mins), to give a total of 21 speakers and train from that.

When I pick these 20 speakers, does it matter what gender they are? For example, if I am trying to clone a Female voice, should I pick only female speakers? Or is a mix OK?

When you train models with your 25 speakers, do you begin training with the repository's pre-trained LJSpeech model, or another model? How good is the result, does inference sound similar to your 5 speakers voices?

That is a very kind offer to use your hardware, but I am OK with waiting longer :)

I start from pretrained ljspeech, results was rly good but also my dataset quality is good enough for training even without libri dataset.

Mix of voices should be ok only female should works fine too :P

That's good to know. Thank you :)

When you say your dataset quality is good enough even without libri, how much speech do you have per speaker?

40 min to 6h per speaker

Hi, I would like to achieve a similar result as is discussed above (fine tuning with a small data set). I am struggling to follow the FastSpeech2 End to End tutorial though.

Is there a Colab notebook for this process? In particular the first section - creating the dataloader.

Or perhaps if anyone has performed the process of fine-tuning, they could share the commands used from their terminal history?

Thanks in advance.

@vocajon
Hi,

It is a little unclear at first, personally, I didn't need to create a dataloader (this confused me too). This is what I did:

The first set of instructions are here. Follow step 0 to prepare the LibriTTS dataset structure. If you want to add your own speaker ontop of the LibriTTS speakers (as I do), copy the structure in the output dataset. Add your own speakers as extra folders in the dataset path, and copy the .wav, .txt structure.

Then follow these instructions for MFA and duration extraction, using the libritts config just as in the instructions.

Continue from the first set of instructions again to setup the Docker container, preprocess, normalize.

The final step of those instructions says to run train_libri.sh, this is just a script that starts the training, and should be modified to your needs. The arguments for starting training are better explained here. For fine-tuning you're probably going to need to use the --pretrained argument.

Hope that helps you get started.

@OscarVanL thank you very much. Just what I was looking for. Will try following these instructions.
BTW, what have your results been like? Have you had success with real life datasets from your voice bank?

It's still a work in progress, I'm not using any patient data yet, only experimenting with clips I record of my own voice. I have no results yet, I am about to start training with the approach I mentioned a couple of days ago in these comments.

@machineko
When I start with the pretrained LJSpeech model-150000.h5 using the fastspeech2libritts.yaml file along with Libri speakers, I get these warnings:

WARNING:tensorflow:Skipping loading of weights for layer embeddings due to mismatch in number of weights (5 vs 2).
2020-10-26 19:45:51,624 (hdf5_format:765) WARNING: Skipping loading of weights for layer embeddings due to mismatch in number of weights (5 vs 2).
WARNING:tensorflow:Skipping loading of weights for layer decoder due to mismatch in number of weights (68 vs 65).
2020-10-26 19:45:51,707 (hdf5_format:765) WARNING: Skipping loading of weights for layer decoder due to mismatch in number of weights (68 vs 65).
WARNING:tensorflow:Skipping loading of weights for layer f0_predictor due to mismatch in number of weights (13 vs 10).
2020-10-26 19:45:51,719 (hdf5_format:765) WARNING: Skipping loading of weights for layer f0_predictor due to mismatch in number of weights (13 vs 10).
WARNING:tensorflow:Skipping loading of weights for layer energy_predictor due to mismatch in number of weights (13 vs 10).
2020-10-26 19:45:51,723 (hdf5_format:765) WARNING: Skipping loading of weights for layer energy_predictor due to mismatch in number of weights (13 vs 10).
WARNING:tensorflow:Skipping loading of weights for layer duration_predictor due to mismatch in number of weights (13 vs 10).
2020-10-26 19:45:51,726 (hdf5_format:765) WARNING: Skipping loading of weights for layer duration_predictor due to mismatch in number of weights (13 vs 10).
2020-10-26 19:45:51,755 (train_fastspeech2:411) INFO: Successfully loaded pretrained weight from ./pretrained/model-150000.h5.

I then run out of memory.

I found out that changing the batch_size in fastspeech2libritts.yaml from 32 to 16 stopped the out of memory issue, but am still concerned about the skipping of loading weights. Is this something you experienced?

Dunno about that i was loading weights manually per layer, embeddings shouldn't be loaded but the rest should work fine I think.
@dathudeptrai

Also, one trick for transfer learning is to just load weights for bigger layers (like embedding) and then add random init to not existing part of the layer and retrain it all with lower learning rate.

I start from pretrained ljspeech, results was rly good but also my dataset quality is good enough for training even without libri dataset.

Mix of voices should be ok only female should works fine too :P

Hi there, this information is really helpful. I was just wondering how many iterations do train for with your own dataset on top of the pretrained ljspeech?

Here's my tensorboard after training for 18 hours.

image

As mentioned before, I started with the LJSpeech pretrained model, then trained all layers using 20 speakers from LibriTTS with at least 20 minutes of speech, and adding 32 minutes of the speaker to clone. (Total: 8 hours of speech)

The settings used are fastspeech2libritts.yaml, with n_speakers set to 21 and batch_size set to 16 (as I run out of memory otherwise).

Any ideas on how I can improve the training?

@OscarVanL how about add more data, btw, i just commit to use official swish so i think the memory consuming is reduced.

I've increased the speakers from 20 to 40 from LibriTTS, to give 14 hours of speech and will try again.

Unfortunately, even with the latest repo it still runs out of memory with batch_size 32, so am going to keep this at 16, unless you have other suggestions :)

batch_size

if you ignore all samples that its mel_length >= 850 then you can training with batch_size 32. But i always training with batch_size 16 :D. I will also add gradient accumulate so you can training very large batch_size soon :D. BTW, do not forget that MFA used 10ms MFCC so you should choose the right hop_size to maximize the performance (closer to 240 is better in case the sampling rate of original data is 24K), the valid durations loss around 0.04->0.08 is ok.

I will continue to use batch_size 16 if you also use it without problems :)
I have been using hop_size 240 as you suggested before. Thanks for explaining the advantage.

@OscarVanL Are audio generated sounds fine? A bit higher loss != worse model in case of FS2

Do you mean the sounds in outdir/predictions/****_wav? They sound bad.

Here's some predictions from 180,000 steps from the model in the last tensorboard I sent:
9_after.wav
9_before.wav
9_gt.wav
One thing that I noticed is that even the ground truth sounds poor, is this just because it is using a basic vocoder to generate these samples?

Yes u should compare GT vs after/before samples u still need to train vocoder for better quality. Also u send 2x before (before is mel prediction before passing values through post-net but results in most cases are almost the same for both) and not ground truth sample

Also if FS2 didn't work you can try train Taco2

Do you mean the sounds in outdir/predictions/****_wav? They sound bad.

Here's some predictions from 180,000 steps from the model in the last tensorboard I sent:
9_after.wav
9_before.wav
9_gt.wav
One thing that I noticed is that even the ground truth sounds poor, is this just because it is using a basic vocoder to generate these samples?

i think you should use GL first to sanity check rather than use vocoder :D. https://github.com/TensorSpeech/TensorFlowTTS/blob/master/notebooks/griffin_lim_tensorflow.ipynb

Do you mean the sounds in outdir/predictions/****_wav? They sound bad.

Here's some predictions from 180,000 steps from the model in the last tensorboard I sent:

9_after.wav

9_before.wav

9_gt.wav

One thing that I noticed is that even the ground truth sounds poor, is this just because it is using a basic vocoder to generate these samples?

i think you should use GL first to sanity check rather than use vocoder :D. https://github.com/TensorSpeech/TensorFlowTTS/blob/master/notebooks/griffin_lim_tensorflow.ipynb

He is using GL 😃

@machineko hmm, even with groundtruth, somehow his GL generated bad audio ?. I think it should be better :3

He didnt send GL from GT data yet (he send before GL 2 Times) 😄

Also u send 2x before

Oops, I updated the links :)

@OscarVanL how about add more data

Here's the tensorboard after adding more data. (40 speakers LibriTTS, 32 minutes of my speaker = 14 hours speech.)

image

It looks like the mel loss is better this time, but not duration/energy/f0.

These are from 215k steps:

b'8468_286673_000020_000003'_gt.wav

b'8468_286673_000020_000003'_before.wav

b'8468_286673_000020_000003'_after.wav

This is the best model so far, you can just about understand the words, but it is still robotic and unclear.

Do you think it would be advantageous to continue to train more steps?

maybe you need continue add more data :v. Energy and F0 are always overfit and that ok :))), duration loss bellow 0.1 is ok. The best model should be in range 80k -> 150k steps :D. You can also add ljspeech data :D.

OK thanks! Do you know how many hours of speech people have had success with in the past when training LibriTTS so I get a better idea of what works?

OK thanks! Do you know how many hours of speech people have had success with in the past when training LibriTTS so I get a better idea of what works?

In taco2 blog/paper they transfer learning from ljspeech using 30min of speak and got good results, optimal was about 1.5/2h if I remember correctly, but ofc more data => better results.

GL will always sound robotic try some universal vocoder.

Yes I was about to try the model using the MB-MelGAN universal vocoder, I'm fighting anaconda at the moment :c

Interesting about taco2. The application for this model is on a mobile app with low latency, so FastSpeech2 appealed because of its fast synthesis and ability to run on low-end hardware. Am I right in my assumption Taco2 is more demanding for inference or has a higher latency than FS2?

OK I'm going through the fastspeech2_inference Notebook...

But after the 'Save to pb' section I get this error:
error

If I change to the libritts mapper, I then get this error:
phoneme error

Am I supposed to be using the LibriTTS or LJSpeech processor?

U are using train mode so all text should be pass as phonemes.
This notebook is/was very old just create phonems using g2p_en and pass it thru model

Remove all empty strings from g2p_en output as its not in your mapper for libritts. (You join it adding another layer of empty strings and you have a string with dual spaces)

smth like that should work ->

input_phonemes = " ".join([input_phoneme_list i for i in input_phoneme_list  if i != " "])

But trivial problems like this you should just debug u can use => https://github.com/jupyterlab/debugger or pycharm/vscode

Yeah thank you, I spotted the mistake in the end (why I deleted my reply).

For some reason, the shape of the generated mel is very short.

processing phonemes

building FS2 & loading weights

inference

Am I doing the inference correctly, or has this changed since the notebook was written?

EDIT: This was just a bad model :)

Smth is wrong i cant check code right now as im not in home (and im not na author of notebook). Ill tag @dathudeptrai and lets wait for him :)

In case of waiting for response you can check if you load weights properly by saving layers weights from training to npy file and then checking weights in notebook

image

It looks like adding more data helped, orange is with 14h, blue with 19.75h.

:))) let continue add more data :))))

image
Better again with more data, looks like f0 is overfitting less. Red line = 24 hrs speech (81 speakers).

I will continue to add more data until the improvements stop :) I will try 120 speakers (32 hours) next.

Hi, with 120 speakers (32 hours) I am seeing higher than expected f0 loss.
image

Is there any reason for this? It does not seem to be improving. Maybe the dataset is too large now?

Eval dataset is about 5% of the dataset if u don't change this value, so be aware of having some randomness in results on eval graphs :)

Hi there, I'm trying to follow this thread to do my own training on a small dataset (~40 mins) but I am getting lost and have a few questions.

  1. For the universal vocoder, should I just download the MB-Melgan available under examples and use it, or should I continue training it on my small dataset or a multi-speaker dataset first?
  2. What is MFA extraction and when should I do it?
  3. The samples in my dataset are not a consistent sample rate, should I make them all the same (24000hz)?
  4. During inference, what is the processor, and what is the difference between the ljspeech mapper and libritts mapper?

I probably have more questions but I think if I know these things it'll get me started on the right track. This thread has been very helpful so far in helping me learn the most effective way to train on a small dataset so thank you.

Hi Gavin, I'm also new to this (hence why this thread has made it to 74 replies), I'll summarise what I can:

  1. My understanding is that a Universal vocoder is fine (and the MB-MelGAN one is ready-to-use), but you can expect to get better results if you fine-tune it on your speaker.
  2. MFA (Montreal Forced Alignment) is a feature extraction stage. Your .wav files are aligned to the .txt transcriptions to make a pronunciation dictionary for the phonemes used in pronouncing each word. This concept is how the model will generalise to unseen words.
  3. One of the preprocessing scripts for LibriTTS should resample the audio, but I was suggested to do it myself. There's a good command-line utility called SoX which can help with resampling in bulk.
  4. The processor is what takes some text "hello" and processes this into the relevant symbols for the model you have trained. Some use character symbols (h,e,l,l,o,...), others will use phoneme symbols (HH, AH0, L, OW1). So there are different processors for different models, depending on which input symbols they take.

Hi Gavin, I'm also new to this (hence why this thread has made it to 74 replies), I'll summarise what I can:

  1. My understanding is that a Universal vocoder is fine (and the MB-MelGAN one is ready-to-use), but you can expect to get better results if you fine-tune it on your speaker.
  2. MFA (Montreal Forced Alignment) is a feature extraction stage. Your .wav files are aligned to the .txt transcriptions to make a pronunciation dictionary for the phonemes used in pronouncing each word. This concept is how the model will generalise to unseen words.
  3. One of the preprocessing scripts for LibriTTS should resample the audio, but I was suggested to do it myself. There's a good command-line utility called SoX which can help with resampling in bulk.
  4. The processor is what takes some text "hello" and processes this into the relevant symbols for the model you have trained. Some use character symbols (h,e,l,l,o,...), others will use phoneme symbols (HH, AH0, L, OW1). So there are different processors for different models, depending on which input symbols they take.

I see, thank you for the answers. I am going to have a go at training with 52 speakers + my dataset, and see how I go.

One more question, when you add your dataset to libritts, do you add the speaker name and some speaker-id (e.g. 000) to the speaker.txt file? or is that not necessary? (i.e. is it enough to just format the dataset in the same structure as the other speakers?)

Good question, I first run the examples/FastSpeech2_libritts/libri_experiment/prepare_libri Jupyter notebook to prepare libriTTS into the correct folder structure.

After this, open the output folder, create a new folder with a speaker ID, I used "1".

Add your .wav and .txt inside this folder. It's also important that your files copy the same naming scheme, starting with the speaker ID, so I named mine like so:

LibriTTS-Formatted/1/
1_bookName_001.wav
1_bookName_001.txt
1_bookName_002.wav
1_bookName_002.txt
..

You don't have to change speakers.txt, the speaker metadata is inferred from the dataset folder structure for libriTTS.

Hi, I've noticed a problem across multiple models I've trained. The speech sounds tolerable, but then at the end of the sentence, it trails off into garbage.

Here are a few FS2 models saying three sentences of different lengths:

  1. "Hello world this is a test of the voice text to speech"
  2. "This is a test"
  3. "The end of the text always goes bad"

This is using the MB-MelGAN universal vocoder :)

Model 8 (24 hours, 80000 steps) (1) (2) (3) (Tensorboard)

Model 9 (32 hours, 80000 steps) (1) (2) (3) (Tensorboard)

Model 10 (29.5 hours, 80000 steps) (1) (2) (3) (Tensorboard)

If I pass in different speaker IDs, some sound better, but they all have this problem with the end of the speech going wrong.

Any suggestions on what I can do to fix this? Could it be related to my training audio clips? I am using LibriTTS clips and my own data between 2-15 seconds.

@OscarVanL Are u trim dataset using multispeaker preprocessing? (This remove both SIL and END token from the end of the sentence if yes you need to also remove END and SIL token from the end of the sentence in your inference example)

prepro

@dathudeptrai Thanks for the link. Based on that I should train at hop_size: 300 instead of 240, and set fft_size: 2048 to match the multiband_melgan.v1_24k universal vocoder.

Should hop_size in the preprocess config and the model hyperparameters both be set to 300? I noticed I have been training with hop_size mismatched as 256 and 240 all this time.

@machineko
For the preprocessing, I am using the /preprocess/libritts_preprocess.yaml config, which has trim_mfa: true and trim_silence: true. I will change to ZDisket's librittsv2_preprocess.yaml config instead.

At inference, my sentences do not have any END or SIL symbols at the end of the sentence.
image
Adding the symbols to the sentences did not help.

@OscarVanL Yes hop_size should be the same in the model parameters as in the config params (i was training on 256 hop_size but it really shouldn't make a difference).

Awesome, I'll try re-training with these new params :)

@OscarVanL Ok one more thing, try to remove every SIL token from input_ids and push it again (it for sure will fix few problems with performance :D ) [U can fix it as it wasn't pushed with a pull request for some reason by changing this line in processor here => clean_g2p

It does make it sound better (but very fast), I remember reading your comment in #243 that you added the SIL just for testing purposes :)

Speed can be adjusted by using speed_ratios params

The new params improved the model performance significantly (green line).
image

However, speech still has a lot of electronic noise/buzzing.

Speech: "Oak is strong and also gives shade. Cats and dogs each hate the other. The pipe began to rust while new. Open the crate but don't break the glass."

Do you have any suggestions for improving this?

I was considering switching my LibriTTS subset to train-clean-360, this way each of my speakers can have 25 minutes of speech, whereas with train-clean-100 some have as little as 12 minutes.

The new params improved the model performance significantly (green line).
image

However, speech still has a lot of electronic noise/buzzing.

Speech: "Oak is strong and also gives shade. Cats and dogs each hate the other. The pipe began to rust while new. Open the crate but don't break the glass."

Do you have any suggestions for improving this?

I was considering switching my LibriTTS subset to train-clean-360, this way each of my speakers can have 25 minutes of speech, whereas with train-clean-100 some have as little as 12 minutes.

can you share me the mel-spectrogram figure of above sentence ?

@OscarVanL a mel is good, i think you need fine-tune the mb-melgan, it should improve much.

@dathudeptrai

To go back to the original purpose of this thread, I aimed to clone a patient voice. Currently, if I pass in the speaker_id for my patient voice (trained amongst LibriTTS speakers), it does not sound like the patient, because it still has an American accent.

Your suggestion was to take a pre-trained FS2 model (which I have now trained), then fine-tune it to transfer the British accent (See your comments here and here). I also need to fine-tune Mb-Melgan.

When I tried this before, it didn't work because the LJSpeech model was trained on characters, not phonemes. Now I have a model trained on phonemes, so would like to try it again, but am unsure which layers I should fine-tune to transfer the accent. Should I do the same ones you suggested before?

When fine-tuning Mb-Melgan, do you have any suggestions? Should I fine-tune on the whole dataset, or just my patient's dataset?

Thank you once again! 😄

@OscarVanL the only layer that you should keep (not fine-tune) is phoneme embedding, it represents the content of the input text, the other layers, you can fine-tune. I believe that if you have general phoneme embeddings, you can ez transfer the model to ur target voice.

About mb-melgan, you just need put ur patient's dataset into clean dataset and training together.

Ok I will try that now Thanks!

after that, maybe you should try hop_size is 240 for 24k audio and try again. MFA uses 10ms to calculate duration, so the hop_size should be 240 to match exactly with the duration extracted from MFA, if we use 300 or 256 then we should round the duration and this duration is not precise :D.

About this, I remember the universal vocoder is trained with hopsize 300 https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_melgan

Should the fastspeech2 model match to this hopsize if we want to use this vocoder? @dathudeptrai

@dathudeptrai

Hi, I was looking at the list of layers to ensure I include everything except the phoneme embeddings.

When you say to not fine-tune "phoneme embeddings", do you mean specifically to avoid the tf_fast_speech2/embeddings/charactor_embeddings/weight:0 layer, or all of the embeddings layers?

My plan was to use this:

var_train_expr: "speaker|embeddings/speaker_embeddings|embeddings/speaker_fc|encoder|decoder|mel_before|postnet|f0_predictor|energy_predictor|duration_predictor|f0_embeddings|energy_embeddings"

to match everything except tf_fast_speech2/embeddings/charactor_embeddings/weight:0

Layer names
tf_fast_speech2/embeddings/charactor_embeddings/weight:0
tf_fast_speech2/embeddings/speaker_embeddings/embeddings:0
tf_fast_speech2/embeddings/speaker_fc/kernel:0
tf_fast_speech2/embeddings/speaker_fc/bias:0
tf_fast_speech2/encoder/layer_._0/attention/self/query/kernel:0
tf_fast_speech2/encoder/layer_._0/attention/self/query/bias:0
tf_fast_speech2/encoder/layer_._0/attention/self/key/kernel:0
tf_fast_speech2/encoder/layer_._0/attention/self/key/bias:0
tf_fast_speech2/encoder/layer_._0/attention/self/value/kernel:0
tf_fast_speech2/encoder/layer_._0/attention/self/value/bias:0
tf_fast_speech2/encoder/layer_._0/attention/output/dense/kernel:0
tf_fast_speech2/encoder/layer_._0/attention/output/dense/bias:0
tf_fast_speech2/encoder/layer_._0/attention/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._0/attention/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._0/intermediate/conv1d_1/kernel:0
tf_fast_speech2/encoder/layer_._0/intermediate/conv1d_1/bias:0
tf_fast_speech2/encoder/layer_._0/intermediate/conv1d_2/kernel:0
tf_fast_speech2/encoder/layer_._0/intermediate/conv1d_2/bias:0
tf_fast_speech2/encoder/layer_._0/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._0/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._1/attention/self/query/kernel:0
tf_fast_speech2/encoder/layer_._1/attention/self/query/bias:0
tf_fast_speech2/encoder/layer_._1/attention/self/key/kernel:0
tf_fast_speech2/encoder/layer_._1/attention/self/key/bias:0
tf_fast_speech2/encoder/layer_._1/attention/self/value/kernel:0
tf_fast_speech2/encoder/layer_._1/attention/self/value/bias:0
tf_fast_speech2/encoder/layer_._1/attention/output/dense/kernel:0
tf_fast_speech2/encoder/layer_._1/attention/output/dense/bias:0
tf_fast_speech2/encoder/layer_._1/attention/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._1/attention/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._1/intermediate/conv1d_1/kernel:0
tf_fast_speech2/encoder/layer_._1/intermediate/conv1d_1/bias:0
tf_fast_speech2/encoder/layer_._1/intermediate/conv1d_2/kernel:0
tf_fast_speech2/encoder/layer_._1/intermediate/conv1d_2/bias:0
tf_fast_speech2/encoder/layer_._1/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._1/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._2/attention/self/query/kernel:0
tf_fast_speech2/encoder/layer_._2/attention/self/query/bias:0
tf_fast_speech2/encoder/layer_._2/attention/self/key/kernel:0
tf_fast_speech2/encoder/layer_._2/attention/self/key/bias:0
tf_fast_speech2/encoder/layer_._2/attention/self/value/kernel:0
tf_fast_speech2/encoder/layer_._2/attention/self/value/bias:0
tf_fast_speech2/encoder/layer_._2/attention/output/dense/kernel:0
tf_fast_speech2/encoder/layer_._2/attention/output/dense/bias:0
tf_fast_speech2/encoder/layer_._2/attention/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._2/attention/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._2/intermediate/conv1d_1/kernel:0
tf_fast_speech2/encoder/layer_._2/intermediate/conv1d_1/bias:0
tf_fast_speech2/encoder/layer_._2/intermediate/conv1d_2/kernel:0
tf_fast_speech2/encoder/layer_._2/intermediate/conv1d_2/bias:0
tf_fast_speech2/encoder/layer_._2/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._2/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._3/attention/self/query/kernel:0
tf_fast_speech2/encoder/layer_._3/attention/self/query/bias:0
tf_fast_speech2/encoder/layer_._3/attention/self/key/kernel:0
tf_fast_speech2/encoder/layer_._3/attention/self/key/bias:0
tf_fast_speech2/encoder/layer_._3/attention/self/value/kernel:0
tf_fast_speech2/encoder/layer_._3/attention/self/value/bias:0
tf_fast_speech2/encoder/layer_._3/attention/output/dense/kernel:0
tf_fast_speech2/encoder/layer_._3/attention/output/dense/bias:0
tf_fast_speech2/encoder/layer_._3/attention/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._3/attention/output/LayerNorm/beta:0
tf_fast_speech2/encoder/layer_._3/intermediate/conv1d_1/kernel:0
tf_fast_speech2/encoder/layer_._3/intermediate/conv1d_1/bias:0
tf_fast_speech2/encoder/layer_._3/intermediate/conv1d_2/kernel:0
tf_fast_speech2/encoder/layer_._3/intermediate/conv1d_2/bias:0
tf_fast_speech2/encoder/layer_._3/output/LayerNorm/gamma:0
tf_fast_speech2/encoder/layer_._3/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._0/attention/self/query/kernel:0
tf_fast_speech2/decoder/layer_._0/attention/self/query/bias:0
tf_fast_speech2/decoder/layer_._0/attention/self/key/kernel:0
tf_fast_speech2/decoder/layer_._0/attention/self/key/bias:0
tf_fast_speech2/decoder/layer_._0/attention/self/value/kernel:0
tf_fast_speech2/decoder/layer_._0/attention/self/value/bias:0
tf_fast_speech2/decoder/layer_._0/attention/output/dense/kernel:0
tf_fast_speech2/decoder/layer_._0/attention/output/dense/bias:0
tf_fast_speech2/decoder/layer_._0/attention/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._0/attention/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._0/intermediate/conv1d_1/kernel:0
tf_fast_speech2/decoder/layer_._0/intermediate/conv1d_1/bias:0
tf_fast_speech2/decoder/layer_._0/intermediate/conv1d_2/kernel:0
tf_fast_speech2/decoder/layer_._0/intermediate/conv1d_2/bias:0
tf_fast_speech2/decoder/layer_._0/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._0/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._1/attention/self/query/kernel:0
tf_fast_speech2/decoder/layer_._1/attention/self/query/bias:0
tf_fast_speech2/decoder/layer_._1/attention/self/key/kernel:0
tf_fast_speech2/decoder/layer_._1/attention/self/key/bias:0
tf_fast_speech2/decoder/layer_._1/attention/self/value/kernel:0
tf_fast_speech2/decoder/layer_._1/attention/self/value/bias:0
tf_fast_speech2/decoder/layer_._1/attention/output/dense/kernel:0
tf_fast_speech2/decoder/layer_._1/attention/output/dense/bias:0
tf_fast_speech2/decoder/layer_._1/attention/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._1/attention/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._1/intermediate/conv1d_1/kernel:0
tf_fast_speech2/decoder/layer_._1/intermediate/conv1d_1/bias:0
tf_fast_speech2/decoder/layer_._1/intermediate/conv1d_2/kernel:0
tf_fast_speech2/decoder/layer_._1/intermediate/conv1d_2/bias:0
tf_fast_speech2/decoder/layer_._1/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._1/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._2/attention/self/query/kernel:0
tf_fast_speech2/decoder/layer_._2/attention/self/query/bias:0
tf_fast_speech2/decoder/layer_._2/attention/self/key/kernel:0
tf_fast_speech2/decoder/layer_._2/attention/self/key/bias:0
tf_fast_speech2/decoder/layer_._2/attention/self/value/kernel:0
tf_fast_speech2/decoder/layer_._2/attention/self/value/bias:0
tf_fast_speech2/decoder/layer_._2/attention/output/dense/kernel:0
tf_fast_speech2/decoder/layer_._2/attention/output/dense/bias:0
tf_fast_speech2/decoder/layer_._2/attention/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._2/attention/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._2/intermediate/conv1d_1/kernel:0
tf_fast_speech2/decoder/layer_._2/intermediate/conv1d_1/bias:0
tf_fast_speech2/decoder/layer_._2/intermediate/conv1d_2/kernel:0
tf_fast_speech2/decoder/layer_._2/intermediate/conv1d_2/bias:0
tf_fast_speech2/decoder/layer_._2/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._2/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._3/attention/self/query/kernel:0
tf_fast_speech2/decoder/layer_._3/attention/self/query/bias:0
tf_fast_speech2/decoder/layer_._3/attention/self/key/kernel:0
tf_fast_speech2/decoder/layer_._3/attention/self/key/bias:0
tf_fast_speech2/decoder/layer_._3/attention/self/value/kernel:0
tf_fast_speech2/decoder/layer_._3/attention/self/value/bias:0
tf_fast_speech2/decoder/layer_._3/attention/output/dense/kernel:0
tf_fast_speech2/decoder/layer_._3/attention/output/dense/bias:0
tf_fast_speech2/decoder/layer_._3/attention/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._3/attention/output/LayerNorm/beta:0
tf_fast_speech2/decoder/layer_._3/intermediate/conv1d_1/kernel:0
tf_fast_speech2/decoder/layer_._3/intermediate/conv1d_1/bias:0
tf_fast_speech2/decoder/layer_._3/intermediate/conv1d_2/kernel:0
tf_fast_speech2/decoder/layer_._3/intermediate/conv1d_2/bias:0
tf_fast_speech2/decoder/layer_._3/output/LayerNorm/gamma:0
tf_fast_speech2/decoder/layer_._3/output/LayerNorm/beta:0
tf_fast_speech2/decoder/speaker_embeddings/embeddings:0
tf_fast_speech2/decoder/speaker_fc/kernel:0
tf_fast_speech2/decoder/speaker_fc/bias:0
tf_fast_speech2/mel_before/kernel:0
tf_fast_speech2/mel_before/bias:0
tf_fast_speech2/postnet/conv_._0/kernel:0
tf_fast_speech2/postnet/conv_._0/bias:0
tf_fast_speech2/postnet/batch_norm_._0/gamma:0
tf_fast_speech2/postnet/batch_norm_._0/beta:0
tf_fast_speech2/postnet/conv_._1/kernel:0
tf_fast_speech2/postnet/conv_._1/bias:0
tf_fast_speech2/postnet/batch_norm_._1/gamma:0
tf_fast_speech2/postnet/batch_norm_._1/beta:0
tf_fast_speech2/postnet/conv_._2/kernel:0
tf_fast_speech2/postnet/conv_._2/bias:0
tf_fast_speech2/postnet/batch_norm_._2/gamma:0
tf_fast_speech2/postnet/batch_norm_._2/beta:0
tf_fast_speech2/postnet/conv_._3/kernel:0
tf_fast_speech2/postnet/conv_._3/bias:0
tf_fast_speech2/postnet/batch_norm_._3/gamma:0
tf_fast_speech2/postnet/batch_norm_._3/beta:0
tf_fast_speech2/postnet/conv_._4/kernel:0
tf_fast_speech2/postnet/conv_._4/bias:0
tf_fast_speech2/postnet/batch_norm_._4/gamma:0
tf_fast_speech2/postnet/batch_norm_._4/beta:0
conv_._0/kernel:0
conv_._0/bias:0
LayerNorm_._0/gamma:0
LayerNorm_._0/beta:0
conv_._1/kernel:0
conv_._1/bias:0
LayerNorm_._1/gamma:0
LayerNorm_._1/beta:0
tf_fast_speech2/f0_predictor/dense_1/kernel:0
tf_fast_speech2/f0_predictor/dense_1/bias:0
tf_fast_speech2/f0_predictor/speaker_embeddings/embeddings:0
tf_fast_speech2/f0_predictor/speaker_fc/kernel:0
tf_fast_speech2/f0_predictor/speaker_fc/bias:0
conv_._0/kernel:0
conv_._0/bias:0
LayerNorm_._0/gamma:0
LayerNorm_._0/beta:0
conv_._1/kernel:0
conv_._1/bias:0
LayerNorm_._1/gamma:0
LayerNorm_._1/beta:0
tf_fast_speech2/energy_predictor/dense_2/kernel:0
tf_fast_speech2/energy_predictor/dense_2/bias:0
tf_fast_speech2/energy_predictor/speaker_embeddings/embeddings:0
tf_fast_speech2/energy_predictor/speaker_fc/kernel:0
tf_fast_speech2/energy_predictor/speaker_fc/bias:0
conv_._0/kernel:0
conv_._0/bias:0
LayerNorm_._0/gamma:0
LayerNorm_._0/beta:0
conv_._1/kernel:0
conv_._1/bias:0
LayerNorm_._1/gamma:0
LayerNorm_._1/beta:0
tf_fast_speech2/duration_predictor/dense_3/kernel:0
tf_fast_speech2/duration_predictor/dense_3/bias:0
tf_fast_speech2/duration_predictor/speaker_embeddings/embeddings:0
tf_fast_speech2/duration_predictor/speaker_fc/kernel:0
tf_fast_speech2/duration_predictor/speaker_fc/bias:0
tf_fast_speech2/f0_embeddings/kernel:0
tf_fast_speech2/f0_embeddings/bias:0
tf_fast_speech2/energy_embeddings/kernel:0
tf_fast_speech2/energy_embeddings/bias:0

@ronggong

About this, I remember the universal vocoder is trained with hopsize 300 >https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_melgan

Should the fastspeech2 model match to this hopsize if we want to use this vocoder?

I at first trained FS2 with hop_size 240 as suggested, but because I was using my model with the pretrained universal vocoder with hop_size 300, it sounded really bad. After I re-trained FS2 with hop_size 300 it sounds much better (even if it is not optimal, as dathudeptrai says).

It does make it sound better (but very fast), I remember reading your comment in #243 that you added the SIL just for testing purposes :)

If you remove SIL in the text, then there is no pause between text right? Also in the MFA alignment preparation, the punctuations seem convert to SIL. So removing SIL means removing the pause between sentences. @machineko

@ronggong The processor adds SIL in-between every word. So it would be transcribed as: Hello @SIL World @SIL. This made the speech sound very unnatural. You can see this in the picture in this comment, symbol 74 (SIL) is inserted between each word.

Stripping out the SIL between words improved the speech a lot. But you should still keep the SIL at punctuation.

It does make it sound better (but very fast), I remember reading your comment in #243 that you added the SIL just for testing purposes :)

If you remove SIL in the text, then there is no pause between text right? Also in the MFA alignment preparation, the punctuations seem convert to SIL. So removing SIL means removing the pause between sentences. @machineko

As i write earlier u need to add punctuation to processor and map it to SIL token or just add SIL to text but not using the processor method at it wasnt intended to be used in this way and i dont know why but me or @dathudeptrai dont rember to change it before merging pull request before [as comment say it should be change in inference] :)
Ill fix it in this weekend and update this topic

1/2 SIL even for long sentence is working good enough but ill think about maybe adding some weight-average based on training dataset to mapper :)

@ronggong The processor adds SIL in-between every word. So it would be transcribed as: Hello @SIL World @SIL. This made the speech sound very unnatural. You can see this in the picture in this comment, symbol 74 (SIL) is inserted between each word.

Stripping out the SIL between words improved the speech a lot. But you should still keep the SIL at punctuation.

I think that's why I could not get it right with python processor, thanks for the info. @machineko @OscarVanL

It's fixed in new master

I fine-tuned on my speaker dataset (32 mins) as described here.

The resultant voice does sound more like the speaker, and definitely sounds like it has a British accent 😄, but the quality is reduced.

  • The grey line is trained on 100 LibriTTS speakers. I took this at the 110k checkpoint for fine-tuning.
  • The orange line is fine-tuning.
  • The blue line is fine-tuning (same settings as orange), but I reduced the eval_interval_steps and save_interval_steps so I can test smaller step intervals

image

The train loss suggests the model overfits the small amount of data. Do you have any suggestions for avoiding this overfitting when fine-tuning on the small amount of data? How many steps should I expect to need?

Here are some inference clips.

"Oak is strong and also gives shade. Cats and dogs each hate the other. The pipe began to rust while new. Open the crate but don't break the glass."

1000 steps. wav, spectrogram

2000 steps. wav, spectrogram

3000 steps. wav, spectrogram

4000 steps. wav, spectrogram

And then it begins to overfit...

20000 steps. wav, spectrogram