liyunlongaaa/NSD-MS2S

diarization

dutchsing009 opened this issue · 7 comments

Would this algorithm be efficient or enough to diarize a video like that Video or is it an overkill. having known that there should be no overlapping speakers at all atleast 99% of the times. and if yes how should I start? a lil guide would be amazing.

I think this kind of animation diarization is more easy to do, you can use the code in this repo to achieve, you can also do multi-modal speaker diarization, this kind of animation accompanied by obvious lip movement is easy to capture, so it may be better than only voice for the effect, this is my suggestion.

Thank you so much , i will try your code and let you know . but are there any suggestions for multi-modal speaker diarization?
Like what's the best repo in your opinion that would fit my case.
Thanks in advance

Unfortunately, as far as I know, multimodal diarization doesn't work very well for open source right now. And I think the premise that multimodal diarization is relatively easy is still to need the corresponding training data, if there is no data that is also not easy.

But I can introduce you to the latest multimodal diarization sota, https://arxiv.org/html/2401.08052v2

Here is our team previous work, https://github.com/mispchallenge/misp2022_baseline/tree/main/track1_AVSD, Although it doesn't work that well, it's one of the few multimodal diarization open source projects I know.

It is ok thanks for all these info , btw i talked to the author of this https://arxiv.org/pdf/2312.05730.pdf and he said the most similar one to it is this https://github.com/showlab/AVA-AVD. But anyways i will use your code for starters. last thing, are there any things i need to do with your code for my use , like should i modify something here or there , what do you think ?

You should first prepare the training set according to the README, go to the Internet to find some open-source English diarization data, extract the fbank features of audio, and follow the training instructions in README. If you do not have any audio signal processing foundation and may be in some trouble, please let me know what you encounter.