BUTSpeechFIT/VBx

Attempting to fix maxSpeakers to 2 sometimes yields a forward-backward error

AdolfVonKleist opened this issue · 5 comments

I have a rather long audio file, about 34min, which I have run with run_example.sh, and my own appropriate .lab file for this recording. For most recordings this works great. But for one I have encountered the following issue when fixing the maxSpeakers variable to 2 (it works if I don't specify, but then it identifies 3 speakers, and I know there are only 2). So far I have been unsuccessful going through the logs.

[-57.92396578 -71.68007889   0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.        ] [-0.69314718 -0.69314718]
Traceback (most recent call last):
  File "VBx/vbhmm.py", line 147, in <module>
    loopProb=args.loopP, Fa=args.Fa, Fb=args.Fb)
  File "VBx/VB_diarization.py", line 130, in VB_diarization
    gamma, tll, lf, lb = forward_backward(lls.repeat(minDur,axis=1), tr, ip) 
    #, np.arange(1,maxSpeakers+1)*minDur-1)
  File "VBx/VB_diarization.py", line 285, in forward_backward
    lfw[0] = lls[0] + np.log(ip)
ValueError: operands could not be broadcast together with shapes (12,) (2,)

Nevermind, I see if know the speaker count a-priori, I can just replace the call to AHC with a random initialization of the labels1st array:

def RandomInit(sim_mx, maxSpeakers):
    from random import randint
    dist = -sim_mx
    dist[np.diag_indices_from(dist)] = np.inf

    labs = np.empty(len(dist), dtype=int)
    for idx,lab in enumerate(labs):
        labs[idx] = randint(0, maxSpeakers-1)

    return labs

I don't know if this would be considered the 'recommended' approach, but it appears to work quite well in practice.

Yes, as you say they have to match. I guess you were setting maxSpeakers without updating qinit. I am glad you solved it.

As for the random initialization, it makes sense to produce good results. For instance, this was explored with the previous version of this model that used MFCCs as input instead of x-vectors. There is a comparison in Table II here and you can see that one random initialization does a decent job. Maybe the difference is smaller with VBx because the speaker models are better than in that version.
In the table there is also the "rand x5" case which is running 5 different random initializations and picking the one with best ELBO (the criterion used in the inference) and you see that the performance can improve. The problem is that running several initializations takes time and in general it is better to start from something that is already reasonable like the output of AHC.
We have not explored much the random initialization with the model that uses x-vectors but it should work well and is definitely not "non-recommended" ;)

Thanks for the tips! Interesting about the random init. A naive add-on question: it looks like the random-init is actually much faster than AHC in practice for long recordings. I am trying to understand how I might test-drive this multi-init approach, but I'm not sure I have a full grasp of the implementation.

    # Right after updating q(Z), tll is E{log p(X|,Y,Z)} - KL{q(Z)||p(Z)}.
    # L now contains -KL{q(Y)||p(Y)}. Therefore, L+ttl is correct value for ELBO.
    L += tll
    Li.append([L])

it looks like this is being iteratively updated, with updates appended to the Li list. Does this mean that I can treat the final value of L as the de-facto ELBO value for a particular random-init stage? Then my 'job' is just picking the best ELBO from among the different inits?

Also it turns out that the initial file I thought to be 'problematic' which triggered this question, which was supposed to
have 2 speakers, actually had 3 speakers in it to begin with; the AHC did a better job than me in correctly identifying the actual number of speakers.

I started to write how to do it but realized that probably this can be useful for other people running on very long files. I have just pushed the addition of the option for random initializations (running 5 of them), see here.
You were right on what you said, see that the code I pushed basically does something on that direction.

Funny that the code did better than you! But glad you brought up the issue. Let me know if you hit any problem with the new code and feel free to suit the MAX_SPKS variable or the amount of random inits to your needs.

Hey great thanks. I think now it is all clear. Please close as you see fit.

Funny that the code did better than you!

Yeah! It was a recording in a language I don't speak, and didn't follow the format it was 'supposed to', but I was still surprised.