/word2vec

In this repository, we create and evaluate multiple Word2vec models using Python's Gensim lib

Primary LanguagePython

Word2vec

Code regarding the first assignment of Natural Language Processing class @ UFMG. The assignment was to create and test various word2vec models, varying parameters like corpus size, use of skip-gram or CBOW algorithm and context window.

Getting Started

Files Needed

Is this assignment, we used the Matt Mahoney's text8 to train our models. To evaluate them, we used Google's questions-words.txt.

Installing

First, clone this repository. Then, You'll need Gensim, NLTK and Matplotlib. You can install those by using pip3 on a terminal:

pip3 install nltk
pip3 install gensim
pip3 install matplotlib

Also, You may need to run this code snippet if it's the first time you use the nltk library

import nltk
nltk.download('punkt')

Running the Code

Open a terminal and run the command:

python3 main.py

Results

For every result, we generate 3 graphs, like the examples below:

Similarity Boxplot:

Example Graph: Similarity Boxplot

Similarity Error Boxplot:

Example Graph: Error Boxplot

Similarity Scatterplot:

Example Graph: Similarity Scatterplot

Note that every graph's name is structured like:

Corpus file+"-w2v-"+ size(Dimensionality of the word vectors) + window(Maximum distance between the current and predicted word within a sentence) + min_count(Ignores all words with total frequency lower than this) + workers(number of threads) + iter(number of epochs) + sg(1 = skip-gram; 0 = CBOW)

Also, one log file that keeps statistics of hits/misses. Unfortunately, I have not come up with a parsing solution to process this data, but it's not hard to do it by hand.

Built With