google/uis-rnn

Confusion about predicted labels

clabornd opened this issue · 3 comments

My background

Have I read the README.md file?

  • yes/no - if you answered no, please stop filing the issue, and read it first

Have I searched for similar questions from closed issues?

  • yes/no - if you answered no, please do it first

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

  • yes/no

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

  • yes/no

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

  • yes/no

Describe the question

I just wanted to confirm that the output of model.predict(test_sequences, inference_args) should be a completely_arbitrary list of labels. My worry is that I expected the labels to look like [0,0,1,1,2, 3 ... 7,1,2,10]. Specifically, that the starting speaker is always the arbitrary '0' speaker, and we will then encounter arbitrary speaker '1', '2' etc. However the output always looks something like: [2, 7, 2, 2, 2, 2, 5, 7, 4, 4]. Is it okay for me to just interpret this as transitions between the 4 arbitrary speakers '2', '4', '5', and '7' and ignore that '0', '1', '3' and '6' are missing?

I tried crawling through the beamsearch implementation but its a bit too dense.

@clabornd What's your --test_iteration argument?

If it is large than 1, it may happen that 0,1,3,6 only appeared in your first iteration.

Ah yes, thank you, I had --test_iteration=2. Setting it to 1 produces the 'expected' output. So the multiple iterations just produces a more stable result, and is it appropriate to interpret the output of [2, 7, 2, 2, 2, 2, 5, 7, 4, 4] as 'this utterance contained 4 speakers that spoke in order X'? Thanks for the quick response.

@clabornd

is it appropriate to interpret the output of [2, 7, 2, 2, 2, 2, 5, 7, 4, 4] as 'this utterance contained 4 speakers that spoke in order X'?

Yes, it is correct.