avoid file

Question

avoid file

Closed this issue 9 months ago · 6 comments

allow to build models directly from a character vector instead of loading in from file
as sometimes annoying with small data especially with utf-8

Answer 1 · 2023-07-05T02:42:59.000Z

Hey! I encountered this issue yesterday and I just created a pull request to fix this issue.
You can install my modified version of word2vec using devtools::install_github('randef1ned/word2vec').

Answer 2 · 2023-07-05T08:16:44.000Z

Your code basically now does no longer work if one uses word2vec and provides the path to a file.
The objective of this issue was to no longer work with files but pass the text directly from R to C++ instead of writing the text from R to a file and reading the file into C++ afterwards.

Answer 3 · 2023-07-05T11:48:36.000Z

It literally can not happen.
Original code: the word2vec() function identifies input argument x as a file if its length is 1.
Updated code: the word2vec() function identifies input argument x as a file if it has a length of 1 and is a valid file.

Yes the ultimate goal was to pass the variable directly to the C++ function, however that couldn't be done without changing the C++ code.

Answer 4 · 2023-07-05T21:35:02.000Z

Yes, that is the objective of this issue, change the C++ code such that instead of writing the text from R to a file and reading the file into C++ afterwards, we pass the texts directly to C++. That requires indeed changes to the C++ code. Are you up to the challenge?

Answer 5 · 2023-07-08T02:56:56.000Z

I skimmed through the C++ code inside this package and found that it actually used file pointers (or something like that) to speed up performance. I doubt the efficiency of passing the texts directly to the main function.

Due to my busy schedule, I have worked at least 12 hours a day in recent days. If there is some spare time, I will look into it and try my best to tackle this challenge.

Answer 6 · 2023-10-05T14:35:06.000Z

You can now since word2vec 0.4.0 build a word2vec model based on a list with tokenised sentences. Examples in README and in the documentation of ?word2vec