/Word_Vectorization

Get the Word Embeddings using methods - SVD (single value decomposition) and Skip-Gram with Negative-Sampling

Primary LanguagePython

Word Embedding Comparison: SVD vs. Skip Gram with Negative Sampling

This project involves implementing and comparing word embedding models using Singular Value Decomposition (SVD) and Skip Gram with Negative Sampling. The analysis focuses on discerning differences in the quality of embeddings produced and their effectiveness in downstream tasks.

1. Introduction

Many NLP systems employ modern distributional semantic algorithms, known as word embedding algorithms, to generate meaningful numerical representations for words. These algorithms aim to create embeddings where words with similar meanings are represented closely in a mathematical space. Word embeddings fall into two main categories: frequency-based and prediction-based.

  • Frequency-based embeddings: Utilize vectorization methods such as Count Vector, TF-IDF Vector, and Cooccurrence Matrix.
  • Prediction-based embeddings: Exemplified by Word2Vec, utilize models like Continuous Bag of Words (CBOW) and Skip-Gram (SG).

2. Training Word Vectors

2.1 Singular Value Decomposition (SVD)

Implemented a word embedding model and trained word vectors by first building a Co-occurrence Matrix followed by the application of SVD.

2.2 Skip Gram with Negative Sampling

Implemented the Word2Vec model and trained word vectors using the Skip Gram model with Negative Sampling.

3. Corpus

Train the model on the given CSV files linked here: News Classification Dataset.

Note: Used the Description column of the train.csv for training word vectors. The label/index column is used for the downstream classification task.

4. Downstream Task

After successfully creating word vectors using the above two methods, evaluated the word vectors by using them for a downstream classification task. Used the same RNN and RNN hyperparameters across vectorization methods for the downstream task.

5. Analysis

Compared and analyzed which of the two word vectorizing methods performs better using performance metrics such as accuracy, F1 score, precision, recall, and the confusion matrix on both the train and test sets. Wrote a detailed report on why one technique might perform better than the other, including the possible shortcomings of both techniques (SVD and Word2Vec).

6. Hyperparameter Tuning

Experimented with three different context window sizes. Reported performance metrics for all three context window configurations. Mentioned which configuration performs the best and discussed possible reasons for it.

Execution

To execute any file, use:

python3 <filename>

To load the pretrained models:

torch.load("<filename>.pt")

Loading Pretrained Models

Word Embeddings

Loading svd-word-vectors.pt and skip-gram-word-vectors.pt gives us a dictionary. From this dictionary, we can access:

  • words_to_ind using dic["words_to_ind"]
  • word_embeddings using dic["word_embeddings"]

To get the word embedding for a token:

  1. Get the index (idx) using words_to_ind[token].
  2. Get the word embedding using word_embeddings[idx].

Classification Models

Loading svd-classification-model.pt and skip-gram-classification-model.pt gives us a model which provides the class index when given a sentence.

  • svd-classification means the model is trained using word embeddings obtained by the SVD method.
  • Similarly, skip-gram-classification refers to the model trained using word embeddings obtained by the Skip Gram with Negative Sampling method.

Links to .pt Files

  1. svd-word-vectors.pt
  2. skip-gram-word-vectors.pt
  3. svd-classification-model.pt
  4. skip-gram-classification-model.pt