bmschmidt/wordVectors

R Session Aborted

Closed this issue · 6 comments

When I run train_word2vec() R crashes immediately. The file to be imported is 200 novels run through prep_word2vec() which results in a 120 MB .txt file. I've tried on Mac 10.9 and 10.10 as well as R 3.1 and 3.2. Same result in all cases. I'm guessing you've run this on much larger data. Any ideas?

Not sure if this will help, maybe worth running the original C implementation to train https://code.google.com/p/word2vec/ and then load the trained model in R? There were a few improvements & bug fixes to word2vec along the way.

Well that was interesting. It turns out it was something to do with the filename -- too long?!? Renaming made it function fine. Oh, R, I love you.

Ah! I think it may be related to a limit in word2vec C code (that R wraps)

Likely to get a "__fortify_fail+0x37" or buffer overflow error for more than 100 character long file names (i can't remember if that includes the absolute path?) for the train or output files.

Also, 100+ character long words in your vocab will get truncated.

This limit is set here: https://github.com/bmschmidt/wordVectors/blob/master/src/word2vec.h#L22 in case you need to change it.

There's a limit case I wouldn't have found. Thanks for the issue, and for figuring it out.

I've just pushed some updates that include upping this limit to 1024 characters, which seems long enough. (I'll post elsewhere, but they also up the number of iterations and include some other fixes that may make models better).

I have a same problem, r session is abnormally aborted when I execute "train_word2vec" function.
Also I tried renaming file/path names shortly or tried with cookbook data set. but same errors occured.

Also, I update limit to 1024 character in here : ( https://github.com/bmschmidt/wordVectors/blob/master/src/word2vec.h#L22)
and execute word2vec.R file (instead of loading library(wordvectors) and execute "train_word2vec" function. then it make another errors :

Error in .C("CWrapper_word2vec", train_file = as.character(train_file), :
C symbol name "CWrapper_word2vec" not in load table

Do you have any idea?
please help~!

this is a different error than the parent issue here; file names up to 1024 characters are supported now, so the place you're changing is going to eat up a lot of memory in changing at the least, and shouldn't have any beneficial effects.

If you post the error that posts with the unmodified code I can take a look. What operating system are you on?