/wiki-word2vec

Train a gensim word2vec model on Wikipedia.

Primary LanguagePythonMIT LicenseMIT

Wiki Word2vec

Train a gensim word2vec model on Wikipedia.

Most of it is taken from this blogpost and this discussion. This repository was created mostly for trying out make, see The gist for the important stuff. Note that performance depends heavily on corpus size and chosen parameters (especially for smaller corpora). Examples and parameters below are cherry-picked.

Usage

Get the code for a language (see here).

Run make with the code as the value for LANGUAGE (or change the Makefile). For instance, try Swahili (sw):

make LANGUAGE=sw

The gist

Ignore make and execute the following bash commands for Swahili:

mkdir -p data/sw/
wget -P data/sw/ https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles.xml.bz2

Train a model in Python:

import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec

wiki = WikiCorpus('data/sw/swwiki-latest-pages-articles.xml.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())
params = {'size': 200, 'window': 10, 'min_count': 10, 
          'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3,}
word2vec = Word2Vec(sentences, **params)

Example 1

Try the old man:king woman:? problem:

female_king = word2vec.most_similar_cosmul(positive='mfalme mwanamke'.split(), 
                                           negative='mtu'.split(), topn=5,)
for ii, (word, score) in enumerate(female_king):
    print("{}. {} ({:1.2f})".format(ii+1, word, score))

1. malkia (0.97)
2. kambisi (0.93)
3. suleimani (0.93)
4. karolo (0.92)
5. koreshi (0.92)

Returning respectively queen (jackpot!), Cambyses II (a Persian king), Solomon (king of Israel), Karolo Mkuu? (Charlemagne?) and Cyrus (a Persian King),

Example 2

What doesn't match: car, train or breakfast?

print(word2vec.doesnt_match('gari treni mlo'.split()))

mlo

Dependencies

  • Python 3
  • pip install gensim