wlin12/wang2vec

word2vec -negative-classes Segmentation fault

Opened this issue · 6 comments

I'm trying to train a model using part-of-speech tags as word classes. When I supply even a very small file of size ~1000 lines with word classes, word2vec causes Segmentation fault. The same setup (train file of 100 lines) but with no -negative-classes argument finishes just fine.
Can anybody suggest how to debug this?
txt100.txt
nc100.txt
Exact command: ./word2vec -train txt100.txt -output model10.txt -hs 0 -size 20 -window 3 -type 3 -threads 1 -negative-classes nc100.txt
P.S. text data and pos tags are taken from the Brown corpus.

I'm also getting a Segmentation fault. I'm trying to train cwindow vectors with a file of size 10^9 bytes (the English wikipedia dump).

./word2vec -train /path/to/trainingfile -output /save/path/file -type 2 -size 20 -window 1 -negative 0 -nce 10 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 15 -cap 1

Starting training using file /path/to/trainingfile
Vocab size: 218317
Words in train file: 123353508
Segmentation fault

I've also tried to use -cap 0, 4 threads, or use -negative rather than nce.

I have same situation with @felicialiu

My backtrace is

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffd7578700 (LWP 10260)]
0x0000000000409418 in TrainModelThread () at word2vec.c:928
928 for (c = 0; c < layer1_size; c++) syn1[c + l2 + window_offset] += g * syn0[c + l1];
(gdb) bt
#0 0x0000000000409418 in TrainModelThread () at word2vec.c:928
#1 0x00007ffff7943aa1 in start_thread (arg=0x7fffd7578700) at pthread_create.c:301
#2 0x00007ffff7690aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Hi @jetsnguns

I looked at the nc100.txt and the problem seems that there are words with a single class like:

BE be

So if we wish to predict the word be, there are no other negative samples to choose from. I would just write a remove these lines.

Wang Ling

Hi @papower1 ,

Thanks for sending me the backtrack, there seems to be a bug indeed in that line as I am using the wrong set of parameters in the update for hierarchical softmax. I corrected it and checked it works in text8, so can you pull and try again.

Wang Ling

Hi @felicialiu,

I tried running your line with text8, and it seems to work.
Do you get to the point where it prints the progress info? If it does not it might be because the program is running out of memory for the parameters.

Wang Ling

@wlin12 btw, when I disable 'hs', It's good to go. Hope it helps.