Word2vec

Code regarding the first assignment of Natural Language Processing class @ UFMG. The assignment was to create and test various word2vec models, varying parameters like corpus size, use of skip-gram or CBOW algorithm and context window.

Getting Started

Files Needed

Is this assignment, we used the Matt Mahoney's text8 to train our models. To evaluate them, we used Google's questions-words.txt.

Installing

First, clone this repository. Then, You'll need Gensim, NLTK and Matplotlib. You can install those by using pip3 on a terminal:

pip3 install nltk
pip3 install gensim
pip3 install matplotlib

Also, You may need to run this code snippet if it's the first time you use the nltk library

import nltk
nltk.download('punkt')

Running the Code

Open a terminal and run the command:

python3 main.py

Results

For every result, we generate 3 graphs, like the examples below:

Similarity Boxplot:

Similarity Error Boxplot:

Similarity Scatterplot:

Note that every graph's name is structured like:

Corpus file+"-w2v-"+ size(Dimensionality of the word vectors) + window(Maximum distance between the current and predicted word within a sentence) + min_count(Ignores all words with total frequency lower than this) + workers(number of threads) + iter(number of epochs) + sg(1 = skip-gram; 0 = CBOW)

Also, one log file that keeps statistics of hits/misses. Unfortunately, I have not come up with a parsing solution to process this data, but it's not hard to do it by hand.

brenomatos/word2vec