
authorship identification, UH course 'Machine Learning'

Primary LanguageJupyter Notebook

authorship identification

authorship identification, UH course 'Machine Learning'


Python3 + LGBM + NLTK + textstat + Jupyter notebook + pandas + scikit-learn + pymysql

How to use

Dataset: download from http://ritual.uh.edu/resources/
Authorship Attribution on Reviews (CICLING 2016), you only need to download Amazon review.
AA.sql: run this script to generate train+validation+test dataset. Each author will have 200 comments, total 1000 authors.
data_processing.ipynb: the main code to do feature engineering.
model.ipynb: code for LGBM model.
Final_AA_Group1.pdf: final presentation.
feature importance