Pre-train model results
ivancarapinha opened this issue · 3 comments
Hello,
I trained the pre-train model with the following specs:
mel_mean_std
,spec_mean_std
for feature normalization, and phonemes were obtained by running the scriptextract_features.py
;- 99 speakers were used, and train/evaluation/test sets were created according to the paper (10 utterances per speaker in eval_set, 20 utterances per speaker in the test set and the rest in the training set);
- learning rate decay of 0.95 every 1000 steps;
- batch size = 32.
I obtained intelligible, but poor results in terms of voice conversion and quality in general. Also, I realized that the generated VC speech seems to be slower than the original source utterances. Besides, many of the generated samples (typically with 2-4 seconds of speech) have large sections of silence, sometimes more than 20 seconds. I include some of the samples (200k steps of training) and source utterances attached below. What could explain these problems?
samples_checkpoint_200000.zip
Additionally, I would like to ask if the following issues could be some of the reasons for these bad results:
- Since I ran the
extract_features.py
file,mel_mean_std
andspec_mean_std
were obtained from all 109 speakers in the data set, but I use only 99 speakers, so, should I get themel_mean_std
andspec_mean_std
for only the 99 speakers I use? Furthermore, shouldmel_mean_std
andspec_mean_std
be obtained only from data in the training set? - Also, I plotted the speaker embedding for some utterances in the training set (10 speakers, 12 utterances per speaker) and although the clusters seem quite good, they are not linearly separable in terms of speaker gender (male/female), as the paper suggests. In the plot below, triangles represent female speakers, and circles represent male speakers.
Thank you very much
Hi,
For mean_std file, It should be estimated using only training data theoretically. However, I don't think it'll lead to bad result if you using all training data.
For the speaker embedding, I believe it seems to be good enough.
I suppose the reason is the learning rate decays too fast and the model doesn't get well trained. You can try to keep learning rate at 0.001 in first 70 epochs then decay it.
For the starting pause, you can trim the beginning / ending silence when preparing training data.
Hello again @jxzhanggg,
Thank you for your reply. I followed your suggestions and although I noticed a subtle improvement in terms of intelligibility and alignment (there are fewer silences now and the voice speed seems a bit more natural), the voice quality did not seem to change at all, as you can verify in these samples from checkpoints 56k and 98k.
VC_samples.zip
Do you think the learning rate variation is the issue here? By the way, I am receiving warnings when I run the program, but all of those are deprecation warnings related to the versions of PyTorch, TensorFlow, and NumPy, so I think that is not problematic. I also checked the mel-spectrograms generated at inference stage and they seem fine, so I really don't know why the voice conversion task performs poorly in the code I run.
Could you please specify exactly what steps you took during pre-train to achieve your results?
Thank you
Hello (once again) :)
I think I discovered what was the problem in the generated .wav files. It happens that librosa.load
automatically converts the sampling rate of the .wav file to 22.05 kHz, while in the file inference.py
, in lines:
nonparaSeq2seqVC_code/pre-train/inference.py
Line 105 in e2fe195
the sampling rate is defined as 16 kHz. This was causing severe distortion in the generated audio files, so we should choose sr=22050
in this case. I suggest you updated this piece of code, as it could save time and stress to other users that might face the same problem.
Cheers