BUTSpeechFIT/VBx

Problem of different embedding extractors

axuan731 opened this issue · 7 comments

Dear BUT team members:
Thank you very much for your contribution on speaker diarization. Actually, I have encountered some very strange problems with VBx clustering. I have trained two models, ResNet34 and ResNet101, using the Wespeaker tool. Both models have the same training strategy and do not have problems in EER. When I extracted speaker embeddings on the AMI test set and used spectral clustering, ResNet101 demonstrated better performance. Everything works fine up to here.
When I trained PLDA for both models separately using VoxCeleb2, the ResNet34 model got a very low DER (2.61%) on the AMI test set (beamformed dataset), but ResNet101 had a very high DER on the dataset (DER > 20%). I used the same parameters: window length 1.5s, window shift 0.6s, Fa=0.3, Fb=17, loopP=0.99, Th=-0.015. I tried to optimize the performance of ResNet101 by tuning the parameters, but I never get good results. In addition, I previously tried to use the ResNet152 model provided by Wespeaker and again could not get better results than ResNet34. I would greatly appreciate your assistance!^_^

Hi @axuan731
we are a bit puzzled by this as well since it seems you followed the same steps for both models. My guess is that perhaps there is some mismatch somewhere caused by an involuntary error (like using a PLDA model of one extractor with the other one or something like this).
When you evaluated your x-vector extractors in terms of EER, did you use cosine similarity to compare the embeddings? If so, I would suggest also using the PLDA models you trained with the corresponding extractors. This way, you should be able to validate that the PLDA model is correct. If the EER is reasonable when using the PLDA, the next step could be to try running AHC based on the PLDA scores (as we used to do before replacing it by cosine similarity).
I hope these validations steps will help you to find the problem.

@axuan731 did you retrain LDA, mean - postprocessing model?

Hi @axuan731 we are a bit puzzled by this as well since it seems you followed the same steps for both models. My guess is that perhaps there is some mismatch somewhere caused by an involuntary error (like using a PLDA model of one extractor with the other one or something like this). When you evaluated your x-vector extractors in terms of EER, did you use cosine similarity to compare the embeddings? If so, I would suggest also using the PLDA models you trained with the corresponding extractors. This way, you should be able to validate that the PLDA model is correct. If the EER is reasonable when using the PLDA, the next step could be to try running AHC based on the PLDA scores (as we used to do before replacing it by cosine similarity). I hope these validations steps will help you to find the problem.

Hello,
Thank you very much for your response, and I apologize for the delay in replying. Firstly, I have confirmed that I am using PLDA correctly, for example, using ResNet101 as the embedding extractor and PLDA based on ResNet34. Secondly, I used PLDA for scoring, and the EER for ResNet34 on the VoxCeleb1 test set is 1.383%, while ResNet101's EER is 0.973%. This confirms that my PLDA training is not the issue. Lastly, I tested ResNet101's performance using PLDA-based AHC on the AMI test set, and the experimental results indicate that ResNet101's performance is still very poor. In addition, I have made further attempts, such as checking for overfitting in ResNet101. I tested the EER with different embedding extractors on a Chinese speaker verfication dataset, and ResNet101 performs well. However, on Voxconverse, the DER for ResNet101 is still higher than that of ResNet34.
I suspect that if the embedding extractor is not the issue, it might be related to the training data for PLDA. For example, deeper networks might struggle to learn the differences between individual datasets. (In the training script you provided, using datasets with different languages helps address this issue).
If you have any further thoughts or suggestions, please feel free to let me know. Thank you very much!

@axuan731 did you retrain LDA, mean - postprocessing model?

Hello,

Thank you for your response. I have carefully examined the training script provided at "https://github.com/phonexiaresearch/VBx-training-recipe/blob/main/run.sh#L170." In stage 6, we extract embedding for each speaker's utterance, so I need to replace the corresponding ONNX model. In stage 7, there is no need for replacement, so I believe my training is correct. Additionally, I attribute the successful training of my ResNet34 to the correct training of LDA.

I am sorry to read that none of those suggestions helped.
Just to recap, so far

  1. The ResNet101-based extractor performs well regarding EER, using either cosine similarity or the PLDA backend
  2. Using the ResNet101-based extractor with PLDA-based AHC for diarization performs poorly
  3. VBx with the ResNet101-based extractor performs poorly

You followed exactly the same steps for the ResNet34 and ResNet101 models so I guess the PLDA and other transformations' parameters were estimated as they should.
Have you tried using the ResNet101-based extractor with cosine-based AHC? This could help us knowing if the problem is with the backend in the context of diarization or if it is an issue with the extraction of the embeddings.
The main difference with the use of the extractor for verification is that in diarization we have short segments with some overlap between them and there are also VAD segments. Perhaps there is some mismatch somewhere.
Another thing you can try is to compare the missed speech and false alarm components of the DER between using ResNet34 and ResNet101 (no matter if with AHC or VBx). They should be exactly the same and the differences in error should only be in terms of speaker confusion. If they are not the same, that means there is a different treatment of the segments.

Hi @axuan731 we are a bit puzzled by this as well since it seems you followed the same steps for both models. My guess is that perhaps there is some mismatch somewhere caused by an involuntary error (like using a PLDA model of one extractor with the other one or something like this). When you evaluated your x-vector extractors in terms of EER, did you use cosine similarity to compare the embeddings? If so, I would suggest also using the PLDA models you trained with the corresponding extractors. This way, you should be able to validate that the PLDA model is correct. If the EER is reasonable when using the PLDA, the next step could be to try running AHC based on the PLDA scores (as we used to do before replacing it by cosine similarity). I hope these validations steps will help you to find the problem.

Dear fnlandini,

Thank you very much for providing these suggestions. After a thorough review process, I finally identified the issues in my code: there is a significant difference between the FBank I used for training my model and the FBank used in Kaldi, as referenced in pytorch/audio#400. When I used the same FBank, all the issues were resolved effortlessly. Once again, thank you for your response and assistance. You can close this issue now.

Great that you found it!