jxzhanggg/nonparaSeq2seqVC_code

Pre-train model results

ivancarapinha opened this issue · 3 comments

Hello,

I trained the pre-train model with the following specs:

  • mel_mean_std, spec_mean_std for feature normalization, and phonemes were obtained by running the script extract_features.py;
  • 99 speakers were used, and train/evaluation/test sets were created according to the paper (10 utterances per speaker in eval_set, 20 utterances per speaker in the test set and the rest in the training set);
  • learning rate decay of 0.95 every 1000 steps;
  • batch size = 32.

I obtained intelligible, but poor results in terms of voice conversion and quality in general. Also, I realized that the generated VC speech seems to be slower than the original source utterances. Besides, many of the generated samples (typically with 2-4 seconds of speech) have large sections of silence, sometimes more than 20 seconds. I include some of the samples (200k steps of training) and source utterances attached below. What could explain these problems?
samples_checkpoint_200000.zip

Additionally, I would like to ask if the following issues could be some of the reasons for these bad results:

  • Since I ran the extract_features.py file, mel_mean_std and spec_mean_std were obtained from all 109 speakers in the data set, but I use only 99 speakers, so, should I get the mel_mean_std and spec_mean_std for only the 99 speakers I use? Furthermore, should mel_mean_std and spec_mean_std be obtained only from data in the training set?
  • Also, I plotted the speaker embedding for some utterances in the training set (10 speakers, 12 utterances per speaker) and although the clusters seem quite good, they are not linearly separable in terms of speaker gender (male/female), as the paper suggests. In the plot below, triangles represent female speakers, and circles represent male speakers.
    speaker_embeddings_plot_checkpoint_200000

Thank you very much

Hi,
For mean_std file, It should be estimated using only training data theoretically. However, I don't think it'll lead to bad result if you using all training data.
For the speaker embedding, I believe it seems to be good enough.
I suppose the reason is the learning rate decays too fast and the model doesn't get well trained. You can try to keep learning rate at 0.001 in first 70 epochs then decay it.
For the starting pause, you can trim the beginning / ending silence when preparing training data.

Hello again @jxzhanggg,
Thank you for your reply. I followed your suggestions and although I noticed a subtle improvement in terms of intelligibility and alignment (there are fewer silences now and the voice speed seems a bit more natural), the voice quality did not seem to change at all, as you can verify in these samples from checkpoints 56k and 98k.
VC_samples.zip

Do you think the learning rate variation is the issue here? By the way, I am receiving warnings when I run the program, but all of those are deprecation warnings related to the versions of PyTorch, TensorFlow, and NumPy, so I think that is not problematic. I also checked the mel-spectrograms generated at inference stage and they seem fine, so I really don't know why the voice conversion task performs poorly in the code I run.

Could you please specify exactly what steps you took during pre-train to achieve your results?
Thank you

Hello (once again) :)

I think I discovered what was the problem in the generated .wav files. It happens that librosa.load automatically converts the sampling rate of the .wav file to 22.05 kHz, while in the file inference.py, in lines:

filters = librosa.filters.mel(sr=16000, n_fft=2048, n_mels=80)

scipy.io.wavfile.write(wav_path, 16000, y)

the sampling rate is defined as 16 kHz. This was causing severe distortion in the generated audio files, so we should choose sr=22050 in this case. I suggest you updated this piece of code, as it could save time and stress to other users that might face the same problem.

Cheers