google/uis-rnn

[Question] How to prepare embedding data for training UIS-RNN?

OpenCVnoob opened this issue · 5 comments

Describe the question

Hi, thank you for open source it !
I have read the 'README.md' file and almost all the issues under this repo. But I 'm still in a puzzle about data pre-processing.

My understanding is that before the training of the UIS-RNN, a speaker embedding network should be trained with some single-speaker utterance-level features , as is mentioned in the paper of GE2E loss, in advance. After that , input frame-level features generated from raw data to the embedding network to generate frame-level embeddings. And then I can use them to train my UIS-RNN. Am I right about that? I 'm wondering whether these frame-level embeddings are 'continuous d-vector embeddings (as sequences) ' you said here.

I am a new comer of speaker diarization and the question I asked really confused me, so I 'd be very grateful if you can help me. Thanks :)

My background

Have I read the README.md file?

  • yes

Have I searched for similar questions from closed issues?

  • yes

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

  • yes

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

  • yes

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

  • yes

Hi,

Short answer

You should NOT use frame-level embeddings. You should use segment-level embeddings, and the corresponding segment-level speaker labels.

Why?

  1. Frame-level embeddings are too many, making the sequence too long, thus:
    a. Too expensive to train.
    b. Too much information for GRU to memorize.
  2. The GE2E training is based on windows. Only last frame output of speaker encoder is used in training. So in inference, you should also only use window embeddings. We use aggregated segment embeddings instead of window embeddings for UIS-RNN mostly for speed. Technically you can also directly use window embeddings.

Thanks for your reply! I got it

@OpenCVnoob hello ,I meet the same problem as you,have you solved it out?
Can you tell me how to del this issue?

@OpenCVnoob I run the project with myown datasets, the print out result is bad.

@Aurora11111
I am new to speaker diarization.I have used deep speaker code for embedding extraction.I have 1 minute audio of 3 speakers.I gave input to deep speaker model and got the embedding.I dont know how to use it for uis-rnn integration.Could you please help me