
A toy test of Word2Vec

Primary LanguagePythonMIT LicenseMIT

Word Vector Test

This repository contains my toy test on several pretrained word2vector models intuitively. A more detailed test can be found here. All the test commentary are in the code, the test code is well self-explained. I use the library gensim. You can also find a torch7 implementation to read binary word vector files here.


The google news does good on the similarity test, but it is not good for several intrinsic analogy test, such as past tense, superlative adjectives.

Where to get a pretrained model

In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request. The table is mainly arranged by 3TOP.

Model file Number of dimensions Corpus (size) Vocabulary size Author Architecture Training Algorithm Context window - size Web page
Google News 300 Google News (100B) 3M Google word2vec negative sampling BoW - ~5 link
Freebase IDs 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Freebase names 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Wikipedia+Gigaword 5 50 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 100 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 200 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 300 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Common Crawl 42B 300 Common Crawl (42B) 1.9M GloVe GloVe GloVe AdaGrad link
Common Crawl 840B 300 Common Crawl (840B) 2.2M GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 25 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 50 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 100 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 200 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Wikipedia dependency 300 Wikipedia (?) 174,015 Levy & Goldberg word2vec modified word2vec syntactic dependencies link
DBPedia vectors (wiki2vec) 1000 Wikipedia (?) ? Idio word2vec word2vec, skip-gram BoW, 10 link

The application of Word2Vector