Is the GRU really needed to predict mu_t ?
hbredin opened this issue · 7 comments
I spent some time trying to figure out what the GRU really does.
My understanding is that it is used to estimate the running mean (mu_t in the paper) of each cluster.
I can see the benefit of a RNN for this (it can learn to not take some noisy samples into account) but I am wondering whether you had the chance to compare to an actual running mean.
@AnzCol I think what @hbredin means is - what if we simply define m_t=x_t
, will it still work? Did we have such experiments (my impression is no)?
Personally I don't think it's going to work well.
My understanding is that (@AnzCol please correct me if I'm wrong), the training process forces m_t
for each speaker to better fall into a normal distribution. But this is not guaranteed in the distributions of x_t
. The power of GRU here is that, to transform the distributions of speaker embeddings into a more clusterable distribution, by learning from the training dataset.
@hbredin Does this explanation make sense to you?
@AnzCol I think what @hbredin means is - what if we simply define
m_t=x_t
, will it still work? Did we have such experiments (my impression is no)?
This is what I meant, indeed.
My understanding is that (@AnzCol please correct me if I'm wrong), the training process forces
m_t
for each speaker to better fall into a normal distribution. But this is not guaranteed in the distributions ofx_t
. The power of GRU here is that, to transform the distributions of speaker embeddings into a more clusterable distribution, by learning from the training dataset.
Except you are still using raw x_t
in Equation 11, so the distribution of speaker embeddings is not changed. Or did I miss something?
@hbredin Does this explanation make sense to you?
Not quite sure -- I think I have to think a bit more about this...
I would really like to see an ablative study with m_t = x_t
:-)