Feature Request Thread
dathudeptrai opened this issue · 36 comments
Don't hesitate to tell me what features you want in this repo :)))
@dathudeptrai What do you think of voice cloning
I would like to see better componentization. There are similar blocks (groups of layers) implemented multiple times, like positional encoding, speaker encoding or postnet. Others relates on configuration specific just for one particular network like self-attention block used in FastSpeech. With a little rework, making those block more generic it would be easier to create new network types. Similarly with losses, e.g. training for hifigan contains many duplicated code from mb-melgan. Moreover, most of the training and inference scripts looks quite similar and I believe they can be refactored too to, once again, compose the final solution from more generic components.
And BTW, I really appreciate your work and think you did a great job! :)
training for hifigan contains many duplicated code from mb-melgan
Hmm, in this case, users just need to read and understand hifigan without reading mb-melgan.
@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?
@unparalleled-ysj What do you mean by voice cloning? You mean zero-shot?
For example, given a short segment of the target person’s voice, the model does not need to be retrained to synthesize the voice of the speaker’s timbre, such as using voiceprint technology to extract speaker embedding to train a multi-speaker TTS model
@unparalleled-ysj That's what I was thinking about. Relevantly, @dathudeptrai I saw https://github.com/dipjyoti92/SC-WaveRNN, could SC-MB-MelGAN be possible?
@unparalleled-ysj @ZDisket That is also what I’m doing. I'm trying to train a multi-speaker fastspeech2 model replacing current hardcoding speaker-ID with bottleneck feature extracted by a voiceprint model. The bottleneck feature of continuous softcoding represents a speaker-related space. If the unknown voice is similar to a voice in the training space, voice cloning may be realized. But judging from the results of current open source projects, it is a difficult problem and certainly not as simple as I described. Do you have any good ideas?
One possible option for better support for multiple speakers or styles would be to add a Variable Auto-Encoder which automatically extracts this voice/style "fingerprint".
LightSpeech https://arxiv.org/abs/2102.04040
@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech
Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).
This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).
Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.
Some samples @ 170k (decoded with pre-trained MB-MelGan):
https://github.com/nmfisher/lightspeech_samples/tree/main/v1_170k
Noticeably worse quality than FastSpeech 2 at the same number of training steps, and it's falling apart on longer sequences.
@abylouw early version of LightSpeech here https://github.com/nmfisher/TensorFlowTTS/tree/lightspeech
Training pretty well on a Mandarin dataset so far (~30k steps) but haven't validated formally against LJSpeech (to be honest, I don't think I'll get time, so would prefer someone else to help out).
This is just the final architecture mentioned in the paper (so I haven't implemented any NAS).
Also the paper only mentioned the final per-layer SeparableConvolution kernel sizes, not the number of attention heads, so I've emailed one of the authors to ask if he can provide that too.
great! :D. how about a number of parameters in LightSpeech ?
My early version of LightSpeech is:
By comparison, FastSpeech 2 (v1) is:
But given the paper claims 1.8M parameters for LightSpeech (vs 27M for FastSpeech 2), my implementation obviously still isn't 100% accurate. Feedback from the authors will help clarify the number of attention heads (and also the hidden size of each head).
Also I think the paper didn't implement PostNet, so removing that layer immediately eliminates ~4.3M parameters.
@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.
@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.
yeah, Postnet is only for faster convergence, we can ignore it after the training process.
@nmfisher 6M params is small enough, did you get a good result with lighspeech ? . how fast is it ?
@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.
yeah, Postnet is only for faster convergence, we can ignore it after the training process.
I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher
I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher
@luan78zaoha lightspeech use separableConvolution :D.
@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 0.018 and 0.01, respectively.
@dathudeptrai I used TF-LITE to inferencing on x86-linux platform. The result is that: RTF of 45M and 10M models were 55.6 and 98.0, respectively.
let wait @luan78zaoha reports lightspeech RTF :D.
@dathudeptrai @nmfisher I also tried to reduce the model size of FastSpeech2 (but not include PostNet modular) with a parameter order: Encoder Dim > 1d_CNN > Attention = Stacks_Num. Reducing encoder-dim is the most effective way to reduce the model size. For the config of fastspeech2.baker.v2.yaml, the model size reduced from 64M to 28M, and the proportion of Postnet modules in the total model size increased from 27% to 62%. Interestingly, the effect does not get worse after deleting Postnet during inference, for Baker Dataset. Thus, the final model size is only 10M. Based on the above experiments, the model size may have the potential to be further reduced.
yeah, Postnet is only for faster convergence, we can ignore it after the training process.
I'm sorry that I haven't studied lightspeech in detail, and I have a question: what's the difference in details between the small-size FastSpeech and lightspeech. @nmfisher
As @dathudeptrai mentioned, LightSpeech uses SeparableConvolution in place of regular Convolution, but then also passes various FastSpeech2 configurations through neural architecture search to determine the best configuration of kernel sizes/attention heads/attention dimensions. Basically they use NAS to find the smallest configuration that performs as well as FastSpeech2.
@dathudeptrai @xuefeng Can you help me implement Higan with fastspeech2 on android? I have tried to implement the same by using https://github.com/tulasiram58827/TTS_TFLite/tree/main/models pretrained model and changing the line
Not really a request but just wondering about the use of Librosa.
I have been playing around with https://github.com/google-research/google-research/tree/master/kws_streaming which uses internal methods for MFCC.
The one it uses is the pyhon.ops one but the tf.signal was also quite a perf boost on using librosa.
Is there any reason for librosa over say tf.signal.stft and tf.signal.linear_to_mel_weight_matrix as they seem extremely performant?
@dathudeptrai What do you think of voice cloning
I have no doubt this project would work wonders on Voice cloning.
With the fastspeech tflite model is it possible to covert to run on a Edge TPU?
If so any examples how?
Will tacotron2 support full integer quantization in tflite?
Current model use full interger quantization failed with "pybind11::init(): factory function returned nullptr." It's likely because the model has multi subgraphs
@dathudeptrai Can you help with implementing forced alignment attention loss for Tacotron2 like in this paper? I've managed to turn MFA durations into alignments and put them in the dataloader, but replacing the regular guided attention loss only makes the attention learning worse, both finetuning and from scratch according to eval results after 1k steps, when in the paper the PAG one should be winning
@ZDisket let me read the paper first :D.
@dathudeptrai Since that post I discovered that MAE loss between the generated and forced attention works to guide it, but it's so strong that it ends up hurting performance, which could be fixed with a low enough multiplier like 0.01, although I haven't tested it extensively as I abandoned it in favor of training a universal vocoder with a trick.
This looks really interesting:
@tts-nlp That looks like an implementation of Algorithm 1. For the 2nd and third, they mention a shift time transform
In order to obtain the shift time transform, the convolution technique was applied after
obtaining a DFT matrix or a fourier basis matrix in most implementations. OLA was applied to obtain the inverse transform
@dathudeptrai What do you think of voice cloning
Hey, I've seen a project about voice cloning recently.
This looks really interesting:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Anybody working on VQTTS?
I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?
In addition, TensorFlowTTS project is not very active, not updated for more than a year.
I have tried fastspeech2 voice cloning base on aishell3 and other data, total 200 speakers,But it didn't work well。Maybe I couldn't train a good speaker embedding model,then I use a wenet-wespeaker pretrained Model(chinese) to extract the speaker embedding vector,But it also works badly。Has anyone tried it?
In addition, TensorFlowTTS project is not very active, not updated for more than a year.
Just been looking at wenet but haven't really made an appraisal but so far seems 'very kaldi' :)