Word2Vec in Torch Yoon Kim yhk255@nyu.edu Only has the skip-gram architecture with negative sampling. See https://code.google.com/p/word2vec/ for more details. Note: This is considerably slower than the word2vec toolkit and gensim implementations. Input file is a text file where each line represents one sentence (see corpus.txt for an example) Arguments are mostly self-explanatory (see main.lua for default arguments) -corpus : text file with the corpus -window : max window size -dim : dimensionality of word embeddings -alpha : exponent to smooth out unigram distribution -table_size : unigram table size. if you have plenty of RAM, bring this up to 10^8 -neg_samples : number of negative samples for each valid word-context pair -minfreq : minimum frequency to be included in the vocab -lr : starting learning rate -min_lr : minimum learning rate--lr will linearly decay to this value -epochs : number of epochs to run -stream : whether to stream text data from HD or store in memory (1 = stream, 0 = not) -gpu : whether to use gpu (1 = use gpu, 0 = not) For example: CPU: th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 1 -gpu 0 GPU: th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 0 -gpu 1