BUTSpeechFIT/VBx

twoGMMcalib_lin's theory

xiaosujie opened this issue · 3 comments

Hello, I want to know why use this method to calculate of the threshold of AHC? Is there any theory based on? Thank you.

threshold = -0.5 * (np.log(weights2 / var) - means2 / var).dot([1, -1]) / (means/var).dot([1, -1])
Hello , I try to understand this formula, but I faild. What's the meaning of this formula? Looking forward to your reply. Thanks.

This is related to the problem of score calibration in speaker recognition using a linear Gaussian backend.

In the twoGMMcalib_lin funcion:
Each of the input scores that we get from the pairwise similarity matrix corresponds to a pair of embeddings that come from the same speaker or from two different speakers.
We assume that the same-speaker scores will be generally higher than the different-speaker scores.
Let’s therefore consider that these scores correspond to two classes (same-speaker or different speakers).
We assume that the distributions of both classes are Gaussians with the same variance (shared variance) and different means.
Since we don’t know which score belongs to each class we train a GMM with two Gaussian components with shared variance in an unsupervised way and we consider the resulting Gaussian distributions as the distributions of the two classes. We also learn the Gaussian component weights, which we consider to be the priors of the two classes.
We assume that the Gaussian distribution with the higher mean corresponds to the same-speaker class.
Now we can think of it as a classification problem where we want to classify the scores into the two classes. For it, we need to find the threshold that best separates both classes so that all scores above the threshold correspond to pairs of embeddings that are more likely to come from the same speaker.

Therefore, we need to find the threshold for which the probability of both classes is the same.
Let C1 and C2 be the classes and s the score, we look for:

p(C1|s)=p(C2|s)
p(s|C1)p(C1)=p(s|C2)p(C2)

where p(C1) and p(C2) are given by the weights of the GMM.
Let w1 and w2 be the weights of the GMM, m1 and m2 the means of the Gaussian components and v the shared variance. In log domain we are looking for:

log p(s|C1)+log p(C1)=log p(s|C2)+log p(C2)
log p(s|C1)+log p(C1)-log p(s|C2)-log p(C2)=0
-0.5((s-m1)^2)/v+log w1 + 0.5((s-m2)^2)/v-log w2=0

where we have already removed the normalizers, which cancel in the equation.

-0.5(s^2-2 s m1+m1^2)/v + 0.5(s^2 - 2 s m2+ m2^2)/v + log w1 - log w2=0
-0.5(-2 s m1+m1^2)/v + 0.5(- 2 s m2+ m2^2)/v + log w1 - log w2=0
s(m1-m2)/v - 0.5 (m1^2 - m2^2)/v + log w1 - log w2 =0

s=(0.5(m1^2 - m2^2)/v - log w1 + log w2 )/(m1-m2)/v

which (after some manipulation) is the same as the code:

threshold = -0.5 * (np.log(weights2 / var) - means2 / var).dot([1, -1]) / (means/var).dot([1, -1])

except that the var in “weights**2 / var” is not in our derivation, because it cancels (can be considered redundant in the code).

ok , thanks for you details. I get it. Thank you very much.