Code regarding the first assignment of Natural Language Processing class @ UFMG. The assignment was to create and test various word2vec models, varying parameters like corpus size, use of skip-gram or CBOW algorithm and context window.
Is this assignment, we used the Matt Mahoney's text8 to train our models. To evaluate them, we used Google's questions-words.txt.
First, clone this repository. Then, You'll need Gensim, NLTK and Matplotlib. You can install those by using pip3 on a terminal:
pip3 install nltk
pip3 install gensim
pip3 install matplotlib
Also, You may need to run this code snippet if it's the first time you use the nltk library
import nltk
nltk.download('punkt')
Open a terminal and run the command:
python3 main.py
For every result, we generate 3 graphs, like the examples below:
Similarity Boxplot:
Similarity Error Boxplot:
Similarity Scatterplot:
Note that every graph's name is structured like:
Corpus file+"-w2v-"+ size(Dimensionality of the word vectors) + window(Maximum distance between the current and predicted word within a sentence) + min_count(Ignores all words with total frequency lower than this) + workers(number of threads) + iter(number of epochs) + sg(1 = skip-gram; 0 = CBOW)
Also, one log file that keeps statistics of hits/misses. Unfortunately, I have not come up with a parsing solution to process this data, but it's not hard to do it by hand.