auspicious3000/contentvec

How to generate {train, valid}.km files?

Closed this issue · 5 comments

freds0 commented

Hi everyone, congratulations on working on this model! I'm getting ready to train the contentvec model on a larger dataset. I've already made a script to generate the spk2info.dict file, which I intend to share. Do you have any tips on how to generate the {train, valid}.km, dict.km.txt files? I'm studying the code in contentvec_dataset.py but it's really hard to understand that part.

freds0 commented

@auspicious3000 thanks a lot for the help! One last question: are there appropriate values for the parameters ${nshard} ${rank} and ${n_clusters}, or can I choose arbitrarily?

${nshard} ${rank} are not the model's parameters. ${n_clusters} is an important parameter of the model. Reading the original hubert paper would be helpful to understand its importance.

freds0 commented

Thank you very much @auspicious3000 !

gu76h commented

Could you tell me how to generate the spk2info.dict file? I had generated {train, valid}.km files,but I don't know how to get it. I cannot train a new model because of the wrong spk2info.dict file.