belerico/word2vec

Little typo "traingular" -> "triangular"

Opened this issue · 3 comments

Hi @belerico,
I am looking at your implementation of W2V and I think it's great. I am just opening an issue as I have noticed this small typo here

"traingular_decay",

Hi, sorry for the late response: you're right :) Now it should be correct.
If you have any additional suggestions feel free to tell me.
Yeah

Hi @belerico,

Thanks a lot for picking this up.

I think this is a great pytorch implementation - the best I've found out there. One may ask: "why reinventing the wheel when there's already Gensim?". I think the answer is that this implementation is completely hackable!
Therefore, to make it more accessible, as next steps I would suggest to:

  • Create a documentation

  • Exploit the README.md to present the library:

  • Provide benchmarks on (cpu and single gpu - as most likely your users will have access to those hardware) against the Gensim implementation

  • Profile your code to check where there is room for performance gains. Most probably the data preparation part will be the most time-consuming. In case, I would suggest trying to jit compile some parts of the code, like what they did with torchtext (example here)

Hi Pietro,
first of all thanks for the good review and suggestions!
Secondly, I try to answer all of your questions

One may ask: "why reinventing the wheel when there's already Gensim?". I think the answer is that this implementation is completely hackable!

Yes, accessibility first, and second because it was a partial work for my master thesis

Create a documentation and Exploit the README.md to present the library

Absolutely, maybe one can also think of create a doc page on readthedocs.org

Provide benchmarks on (cpu and single gpu - as most likely your users will have access to those hardware) against the Gensim implementation
Profile your code to check where there is room for performance gains. Most probably the data preparation part will be the most time-consuming. In case, I would suggest trying to jit compile some parts of the code, like what they did with torchtext

Yep, I can do that! As for now this implementation is way slower than the Gensim one, and even not so well performing w.r.t spearman correlation on the standard dataset (WS-353, M-TURK, ...). The slowness is due to the fact that this is pure python code: one can think of providing an hybrid implementation of cython code and python, but before I would try to use TorchText, as you suggested. For the performing part, the problem is that the c implementation, which Gensim is built upon, is iterative: it learns a vector one word at a time, mine's not: if in a batch it appears the same word say N times, then its representation is updated only one time instead of N times as per Gensim w2v (and the c ones)