A machine learning project for classifying authors based on their writings.
The dataset contains the writings of three authors :
- Edgar Allan Poe (EAP) : 7900 phrases.
- HP Lovecraft (HPL) : 5635 phrases.
- Mary Wollstonecraft Shelley (MWS) : 6044 phrases.
-
Text Processing :
- Tokenization (converting into words or tokens).
- Stemming / Lemmatization (normalizing tokens).
-
Feature Extraction :
- Classical : bag of words / n-grams / TF-IDF.
- Deep Learning : word embeddings.
-
Classification (better approach to train a binary classifier for each author to improve generalization) :
- Classical : Linear Regression / Naive Bayes (BEST) / SVM / XGBoost.
- Deep Learning : RNNs.
- Install
requirements.txt
using PyPi:pip3 install -r requirements.txt
-
For training linear classifier model :
-
Create
models
folder. -
Edit
configs/lc_config.json
. -
Run
train.py
:python train.py train-lc
-
-
For training neural network model :
-
Create
models
andw2v_models
folders. -
Download GloVe embeddings and extract it into
w2v_models
folder. -
Edit
configs/nn_config.json
. -
Run
train.py
:python train.py train-nn
-
-
For inference on linear classifier model :
python evaluate.py eval-lc --author1 /path/to/author1/text --author2 /path/to/author2/text --model /path/to/model/file
-
For inference on neural network model :
python evaluate.py eval-nn --author1 /path/to/author1/text --author2 /path/to/author2/text --model /path/to/model/file --w2v_path /path/to/w2v/model