/stylometry-classification

A machine learning project for classifying authors based on their writings.

Primary LanguageTeXGNU General Public License v3.0GPL-3.0

Stylometry Using Machine Learning

A machine learning project for classifying authors based on their writings.

Dataset

The dataset contains the writings of three authors :

  • Edgar Allan Poe (EAP) : 7900 phrases.
  • HP Lovecraft (HPL) : 5635 phrases.
  • Mary Wollstonecraft Shelley (MWS) : 6044 phrases.

Work Pipeline

  • Text Processing :

    • Tokenization (converting into words or tokens).
    • Stemming / Lemmatization (normalizing tokens).
  • Feature Extraction :

    • Classical : bag of words / n-grams / TF-IDF.
    • Deep Learning : word embeddings.
  • Classification (better approach to train a binary classifier for each author to improve generalization) :

    • Classical : Linear Regression / Naive Bayes (BEST) / SVM / XGBoost.
    • Deep Learning : RNNs.

Installation

  • Install requirements.txt using PyPi:
    pip3 install -r requirements.txt

Usage

  • For training linear classifier model :

    • Create models folder.

    • Edit configs/lc_config.json.

    • Run train.py :

      python train.py train-lc
  • For training neural network model :

    • Create models and w2v_models folders.

    • Download GloVe embeddings and extract it into w2v_models folder.

    • Edit configs/nn_config.json.

    • Run train.py :

      python train.py train-nn
  • For inference on linear classifier model :

    python evaluate.py eval-lc --author1 /path/to/author1/text --author2 /path/to/author2/text --model /path/to/model/file
  • For inference on neural network model :

    python evaluate.py eval-nn --author1 /path/to/author1/text --author2 /path/to/author2/text --model /path/to/model/file --w2v_path /path/to/w2v/model