Python K-Means Cluster of Word2Vec

Basic Usage

Setup

You download Word2Vec model file such as Google Code word2vec.
In this document, We use GoogleNews-vectors-negative300.bin.gz.

Train

$ python3 w2vcluster/w2vcluster.py GoogleNews-vectors-negative300.bin -k 500 -o model1000.pkl

Predict

You can use command line interface.

$ python3 w2vcluster/w2vcluster.py GoogleNews-vectors-negative300.bin -p model500.pkl -w apple Apple banana Google
176 118 176 118

These integer values indicte cluster id of each words.
Also you can use python interface.

$ python3
>>> from gensim.models.word2vec import Word2Vec
>>> from sklearn.externals import joblib
>>> model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
>>> classifier = joblib.load('model500.pkl')
>>> classifier.predict([model[w] for w in ['apple', 'Apple', 'banana', 'Google']])
array([176, 118, 176, 118], dtype=int32)

How to Use