Language distribution
hbredin opened this issue · 3 comments
Thanks for this dataset: a very valuable contribution to the speaker diarization community!
Paper (Section 3, Stage 1) says
To improve the language diversity, we change the website location or use Google Translate to translate those English keywords into different languages such as Chinese, Thai, Korean, Japanese, German, Portuguese, and Arabic.
Therefore, it sounds like MSDWild is one of the few publicly available speaker diarization datasets to actually be multi-lingual. Do you have any estimation of the distribution of languages in MSDWild that you could share?
Hello. My apologies for the late reply. When we initially collected the data, we did some statistical analysis on its distribution. This table presents the distribution statistics: some of the data was extracted from the video's meta information, while some were annotated by our tagging team. Nonetheless, we believe that the current trends across the dataset remain consistent.
Thanks a lot! Better later than never 👍
Thank you for your question. If you have any further questions, please feel free to open a new issue.