word-sense-modeling

These are some codes related to my master thesis. Supplementary documents and codes will be added in future.

Annotate words with parallel corpus (python code)

More info will come here

Convert to word2vec (C code)

This tool converts text based multiple word embedings to binary word2vec file formats. The output file can be used in gensim or other word2vec toolkits. This program has been tested on multiple sense vectors produced by this code and this paper by Neelakantan et al. 2014

The text file can be in two different formats:

Default format:

<Total Number of words> <Dimensionality> <Number of senses per word> <Does it have the max number of senses per word? (1 or 0)>
<word> <no. of senses>
<global context vector>
<first sense vector>
<second sense vector>
...

For example, the text file can be:

2 3 2 1
bank 3
0.2 0.1 0.3
0.1 0.2 0.2
0.3 0.3 0.3
0.1 0.1 0.1
lemon 1
0.4 0.5 0.1
0.1 0.6 0.2

Second format:

<Total Number of words> <Dimensionality>
<word> <no. of senses>
<global context vector>
<first sense vector>
<first sense cluster center>
<second sense vector>
<second sense cluster center>
...