This repository contains my toy test on several pretrained word2vector models intuitively. A more detailed test can be found here. All the test commentary are in the code, the test code is well self-explained. I use the library gensim. You can also find a torch7 implementation to read binary word vector files here.
The google news does good on the similarity test, but it is not good for several intrinsic analogy test, such as past tense, superlative adjectives.
In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request. The table is mainly arranged by 3TOP.
Model file | Number of dimensions | Corpus (size) | Vocabulary size | Author | Architecture | Training Algorithm | Context window - size | Web page |
---|---|---|---|---|---|---|---|---|
Google News | 300 | Google News (100B) | 3M | word2vec | negative sampling | BoW - ~5 | link | |
Freebase IDs | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Freebase names | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Wikipedia+Gigaword 5 | 50 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 100 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 200 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 300 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Common Crawl 42B | 300 | Common Crawl (42B) | 1.9M | GloVe | GloVe | GloVe | AdaGrad | link |
Common Crawl 840B | 300 | Common Crawl (840B) | 2.2M | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 25 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 50 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 100 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 200 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Wikipedia dependency | 300 | Wikipedia (?) | 174,015 | Levy & Goldberg | word2vec modified | word2vec | syntactic dependencies | link |
DBPedia vectors (wiki2vec) | 1000 | Wikipedia (?) | ? | Idio | word2vec | word2vec, skip-gram | BoW, 10 | link |