bnosac/word2vec

Expand functionality to different word embedding files

Opened this issue · 2 comments

Although there is a read.wordvectors function that can read in a plan text file with vectors, the predict.word2vec function only works on 'model' objects, that can not be created from these word vector files.

Would it be possible to have the predict.word2vec function work on only the embedding matrix? This way, it would be possible to use it for all types of word vector models, e.g. trained with fasttext.

predict.word2vec is exactly the same as function word2vec_similarity, which you can apply on 2 embedding matrices or vectors.

  • That will work on embeddings trained with this package as training is optimised for that similarity
  • but this might not be what you want if you have embeddings trained in another framework.

That being said apply word2vec_similarity and see if it works for your embeddings

Note that if you need embedding models with subwords, you might as well use sentencepiece_download_model from the sentencepiece R package. This downloads sentencepiece tokenizers alongside the embedding model trained on wikipedia. Compatible with this R package