This repository implements a data preprocessing script for reading information from large text datasets, as well as implementing a Word2Vec model from nltk as well as an implementation made in PyTorch using a Skip-Gram net.
Use the package manager pip or Anaconda3 to install the required packages from the requirements.txt file.
pip install -r requirements.txt
To use the preprocessing methods import them from preprocessing.py
from preprocessing import get_dataset, stem_words
df = get_dataset(name="amazon_reviews_multi",
lang="en", split="train")
stem_words(df, "review_body", "english")
This repository is currently closed for contributions, except for current members, but feel free to use and redistribute all code for any purposes.