bleqdyce/word2vec

Patch for distance.c: minor off-by-one error

GoogleCodeExporter opened this issue · 0 comments

This is really nitpicky but.... When populating the vocab array, distance.c 
begins skipping characters after index max_w (having read 51 characters), but 
it should have stopped after index max_w - 1. Consequently, the string 
terminator for long strings is entered in the space reserved for the subsequent 
string, and is overwritten when the next string is read in causing the two to 
be mashed together.

For example, when searching for Cash_Flow given the current (as of 2015-06-15) 
GoogleNews-vectors-negative300.bin, two results overflow the printf format 
buffer, which is padded for strings up to length 50; indeed these two string do 
not appear in the vocabulary, but are constructed when two vocabulary entries 
-- a long one followed by a normal one -- are mashed together as described 
above. After applying the attached patch the printf formatting looks fine as 
only the first 50 characters of the long entries are printed.

Original issue reported on code.google.com by daniel.j...@gmail.com on 17 Jun 2015 at 12:24

Attachments: