ReviewClassification

Pandas library is used to read and resize to a balanced dataset.

  • To generate the embeddings, following libraries are used.

  • Yelp! and Zappos datasets are used for the analysis in the project (full datasets are included in the link).

  • Following datasets are csv files containing reviews with their corresponding embedding vectors and labels, which can be used for training and predicting without further processing. (The naming conversion except Yelp.csv and Zappos.csv is as follow: [embedding modeol]-[training corpus]-[training or test and validate])

    all datasets used will be fully uploaded by 19.05.2020

  • Following Jupyter Notebooks are included:

    • s2v-bert-tfif.ipynb
      • This Notebook is used for classification of s2v, BERT and Tf-idf embedding with either Neural Network classifier or SVM
    • word2vec.ipynb
      • This Notebook is used for classification of w2v with either Neural Network classifier or SVM
    • accuracy.ipynb
      • This Notebook includes a method to help calculated class wide accuracy from a confusion matrix.
    • Unitility Notebooks are included in the util folder, which helps to generate and read various embeddings.
    • Detaile instructions are included in the Notebook

Two python files are included to help with the embedding processing and classification. Parameters in the NeuralNetClassifier.py can be adjusted for further testing.