[Question] How to prepare embedding data for training UIS-RNN?

Question

[Question] How to prepare embedding data for training UIS-RNN?

OpenCVnoob opened this issue 6 years ago · 5 comments

Describe the question

Hi, thank you for open source it !
I have read the 'README.md' file and almost all the issues under this repo. But I 'm still in a puzzle about data pre-processing.

My understanding is that before the training of the UIS-RNN, a speaker embedding network should be trained with some single-speaker utterance-level features , as is mentioned in the paper of GE2E loss, in advance. After that , input frame-level features generated from raw data to the embedding network to generate frame-level embeddings. And then I can use them to train my UIS-RNN. Am I right about that? I 'm wondering whether these frame-level embeddings are 'continuous d-vector embeddings (as sequences) ' you said here.

I am a new comer of speaker diarization and the question I asked really confused me, so I 'd be very grateful if you can help me. Thanks :)

My background

Have I read the README.md file?

yes

Have I searched for similar questions from closed issues?

yes

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

yes

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

yes

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

yes

Answer 1 · 2019-02-25T16:10:09.000Z

Hi,

Short answer

You should NOT use frame-level embeddings. You should use segment-level embeddings, and the corresponding segment-level speaker labels.

Why?

Frame-level embeddings are too many, making the sequence too long, thus:
a. Too expensive to train.
b. Too much information for GRU to memorize.
The GE2E training is based on windows. Only last frame output of speaker encoder is used in training. So in inference, you should also only use window embeddings. We use aggregated segment embeddings instead of window embeddings for UIS-RNN mostly for speed. Technically you can also directly use window embeddings.

Answer 2 · 2019-02-25T16:27:40.000Z

Thanks for your reply! I got it

Answer 3 · 2019-02-28T08:28:11.000Z

@OpenCVnoob hello ,I meet the same problem as you,have you solved it out?
Can you tell me how to del this issue?

Answer 4 · 2019-03-01T12:25:17.000Z

oh, sorry I didn't notice this until now. I am still trying to find a good way to segment audio into single-speaker-segmentation, besides,there is no suitable dataset available for me. So I 'm not sure when will I solve this issue. 18210240147 邮箱18210240147@fudan.edu.cn 签名由网易邮箱大师定制 On 02/28/2019 16:28, Aurora11111 wrote: @OpenCVnoob hello ,I meet the same problem as you,have you solved it out? Can you tell me how to del this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 5 · 2019-03-21T03:18:23.000Z

@OpenCVnoob I run the project with myown datasets, the print out result is bad.

Answer 6 · 2019-03-28T11:34:55.000Z

@Aurora11111
I am new to speaker diarization.I have used deep speaker code for embedding extraction.I have 1 minute audio of 3 speakers.I gave input to deep speaker model and got the embedding.I dont know how to use it for uis-rnn integration.Could you please help me