kakaobrain/word2word

Training a new language pairs from scratch

loretoparisi opened this issue · 2 comments

Thanks for this amazing project. My main question is about training.
I have noticed that for Indian Languages, it is missing Marathi mr, and I wonder how to train the model from scratch, having the input parallel corpus, like open subtitles or others like Ted Talks or Tatoeba parallel sentences corpus.
In my case, for indian languages (and a transliteration task), my reference is IndicTrans, https://github.com/libindic/indic-trans
A minor question is about the w2w, PMI and CPE formulas. A great solution! This reminds me Peter Norvig's probabilistic spell checker, where the conditional probabilities are taken in account, while your CPE score makes the difference. Just my wonder :)
Thank you.

About the main question, check out https://github.com/Kyubyong/word2word/blob/master/make.py.
You can train the model from scratch if you have data.

@kimdwkimdw thanks a lot, did not notice that, closing then!