jxzhanggg/nonparaSeq2seqVC_code

Speaker and linguistic embedding visualizations do not look good as in the paper

huukim136 opened this issue ยท 10 comments

Hi @jxzhanggg ,

I trained your model and the converted speeches sound promising (I also attached some samples below).
Then, I tried to visualize the speaker and linguistic embeddings. However, it did not seem perfectly overlapped as in the paper. Moreover, there are still some outliers lied in where it should not have been. (You can observe it in the figures below).
embedding
So I'm wondering if it's due to the wrong chosen parameters for t-SNE visualization function (eg. perplexity, iteration, learning_rate, etc.) or something else.

Could you give me some comments about that.
Thank you!

samples.zip

Hi, it looks good! I used perplexity of 12 for your reference. And the iteration was 1000 epochs.
I think the quality of clustering is affected by both the hyper-params of t-sne and the
random effect in model training process. Try to lower the learning rate at the ending of the training, I think the network will converge better.

Thank you so much!

hi, it seems you have reproduced the results. What else preprocessing have you done? @huukim136

What else preprocessing have you done

Hi @youngsuenXMLY ,
I did nothing but only normalizing the mel features as the author recommended.
In addition, remember to reduce the learning rate gradually as you train the model, you'll get good result.

What else preprocessing have you done

Hi @youngsuenXMLY ,
I did nothing but only normalizing the mel features as the author recommended.
In addition, remember to reduce the learning rate gradually as you train the model, you'll get good result.

Hello @huukim136, @jxzhanggg.
At what pace should the learning rate be reduced? Also, when you say "normalizing the mel features" are you referring to the normalization of mel-spectrograms in file extract_features.py, by setting norm=1?

mel_spectrogram = librosa.feature.melspectrogram(S=spec,

Thank you very much

@ivancarapinha the normalizing process: (x - x_mean)/x_std_var, x_mean is the global mean and x_std_var is the global standard variance.
For the learning rate, I reduce the lr by a factor alpha=0.95. lr = lr*alpha if training_steps%1000==0.

@youngsuenXMLY, what data did you use to compute the global mean and global standard variance? Did you use all mel-spectrograms / spectrograms from the 99 speakers, or only the ones in the training set? Is it necessary to trim leading and trailing silence?

Thank you.

@ivancarapinha

  1. I use all data(99 speakers' data) to generate the global mean and variance.
  2. I use librosa to trim the silence part.

Hi @huukim136, I am trying to visualize the speaker and linguistic embedding as you did, for the linguistic embedding, we want to use text_hidden and audio_seq2seq_hidden as input, it's that what you used for your second figure? But if we use it, each sentence has a different length of phonemes, so the output has a different size, it seems like Tsne Algo prefers a uniform size for its input so did you do a normalizing part for this too?

Thank you!

utput has a different size

Yes, exactly in the second figure I use audio_seq2seq_hidden as input. For example, audio_seq2seq_hidden has the shape of (L, 512), I calculate the mean of all L step to obtain a single (1,512) vector.