/word2vec-api

Simple web service providing a word embedding model

Primary LanguagePython

word2vec-api

Simple web service providing a word embedding API. The methods are based on Gensim Word2Vec implementation. Models are passed as parameters and must be in the Word2Vec text or binary format.

  • Launching the service
python word2vec-api --model path/to/the/model [--host host --port 1234]
  • Example calls
curl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Sushi&ws1=Shop&ws2=Japanese&ws2=Restaurant
curl http://127.0.0.1:5000/word2vec/similarity?w1=Sushi&w2=Japanese
curl http://127.0.0.1:5000/word2vec/most_similar?positive=indian&positive=food[&negative=][&topn=]
curl http://127.0.0.1:5000/word2vec/model?word=restaurant

Note: The "model" method returns a base64 encoding of the Word2Vec vector.

Where to get a pretrained model

In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request.

Model file Number of dimensions Corpus (size) Vocabulary size Author Architecture Training Algorithm Context window - size Web page
Google News 300 Google News (100B) 3M Google word2vec negative sampling BoW - ~5 link
Freebase IDs 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Freebase names 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram ? BoW - ~10 link
Wikipedia+Gigaword 5 50 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 100 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 200 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Wikipedia+Gigaword 5 300 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe AdaGrad 10+10 link
Common Crawl 42B 300 Common Crawl (42B) ~2M GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 25 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 50 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 100 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Twitter (2B Tweets) 200 Twitter (27B) ? GloVe GloVe GloVe AdaGrad link
Wikipedia dependency 300 Wikipedia (?) 174,015 Levy & Goldberg word2vec modified word2vec syntactic dependencies link
DBPedia vectors 1000 Wikipedia (?) ? wiki2vec word2vec word2vec, skip-gram BoW, 10 link