/word2vecpy

Python implementation of CBOW and skip-gram word vector models, and hierarchical softmax and negative sampling learning algorithms

Primary LanguagePython

DISCLAIMER: This is a very old, rather slow, mostly untested, and completely unmaintained implementation of word2vec for an old course project (i.e., I do not respond to questions/issues). Feel free to fork/clone and modify, but use at your own risk!

A Python implementation of the Continuous Bag of Words (CBOW) and skip-gram neural network architectures, and the hierarchical softmax and negative sampling learning algorithms for efficient learning of word vectors (Mikolov, et al., 2013a, b, c; http://code.google.com/p/word2vec/).

Usage

To train word vectors:

word2vec.py [-h] -train FI -model FO [-cbow CBOW] [-negative NEG]
            [-dim DIM] [-alpha ALPHA] [-window WIN]
            [-min-count MIN_COUNT] [-processes NUM_PROCESSES]
            [-binary BINARY]

required arguments:
  -train FI                 Training file
  -model FO                 Output model file

optional arguments:
  -h, --help                show this help message and exit
  -cbow CBOW                1 for CBOW, 0 for skip-gram
  -negative NEG             Number of negative examples (>0) for negative sampling, 
                            0 for hierarchical softmax
  -dim DIM                  Dimensionality of word embeddings
  -alpha ALPHA              Starting learning rate
  -window WIN               Max window length
  -min-count MIN_COUNT      Min count for words used to learn <unk>
  -processes NUM_PROCESSES  Number of processes
  -binary BINARY            1 for output model in binary format, 0 otherwise

Each sentence in the training file is expected to be newline separated.

Implementation Details

Written in Python 2.7.6 and NumPy 1.9.1.

Evaluation

Accuracy (%) on the word analogy task compared against the original C implementation (in parentheses). Trained on a preprocessed version of the first 108 bytes of the English Wikipedia dump on March 3, 2006 (http://mattmahoney.net/dc/textdata.html).

Model Total Semantic Syntactic
CBOW HS 6.76 (6.90) 4.86 (3.61) 7.93 (8.93)
CBOW NS 4.52 (6.72) 3.94 (3.74) 4.88 (8.56)
Skip-gram HS 14.76 (14.59) 11.40 (10.40) 16.83 (17.18)
Skip-gram NS 8.43 (7.72) 4.91 (4.62) 10.62 (9.63)

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013a). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. http://arxiv.org/pdf/1301.3781.pdf

Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL. http://msr-waypoint.com/en-us/um/people/gzweig/Pubs/NAACL2013Regularities.pdf