bnosac/word2vec

avoid file

Closed this issue · 6 comments

allow to build models directly from a character vector instead of loading in from file
as sometimes annoying with small data especially with utf-8

Hey! I encountered this issue yesterday and I just created a pull request to fix this issue.
You can install my modified version of word2vec using devtools::install_github('randef1ned/word2vec').

Your code basically now does no longer work if one uses word2vec and provides the path to a file.
The objective of this issue was to no longer work with files but pass the text directly from R to C++ instead of writing the text from R to a file and reading the file into C++ afterwards.

It literally can not happen.
Original code: the word2vec() function identifies input argument x as a file if its length is 1.
Updated code: the word2vec() function identifies input argument x as a file if it has a length of 1 and is a valid file.

Yes the ultimate goal was to pass the variable directly to the C++ function, however that couldn't be done without changing the C++ code.

Yes, that is the objective of this issue, change the C++ code such that instead of writing the text from R to a file and reading the file into C++ afterwards, we pass the texts directly to C++. That requires indeed changes to the C++ code. Are you up to the challenge?

I skimmed through the C++ code inside this package and found that it actually used file pointers (or something like that) to speed up performance. I doubt the efficiency of passing the texts directly to the main function.

Due to my busy schedule, I have worked at least 12 hours a day in recent days. If there is some spare time, I will look into it and try my best to tackle this challenge.

You can now since word2vec 0.4.0 build a word2vec model based on a list with tokenised sentences. Examples in README and in the documentation of ?word2vec