This repository includes text classification works which involves following steps:
In this step data has been downloaded and cleaned. A new data frame has been created and saved as new data file. Final data file has 3117 rows. There are 5 different classes in the data file.
Python notebook: Getting and Cleaning Data
In this step some exploratory data analysis has been done. Few examples are
- Plot of class frequencies
- Net word frequencies plot
- Plot of frequencies of words in each class
- Plot of Word cloud
Class frequencies | Net word freq | Word freq per class | Word cloud |
---|---|---|---|
Python notebook:Exploratory Data Analysis
This is the main body of the project. It includes three different model building files:
This python notebook walks through the model evaluation using bag of words representation of text data and picks a best model for further parameter tuining. Python notebook: Bag of Words models
This python notebook walks through the model evaluation using TF-IDF representation of text data and picks a best model for further parameter tuining. Python notebook: TF-IDF Models
This python notebook walks through the model evaluation using 'word to vec' representation of text data and and specially train a RNN model. Python notebook: Word to Vec Models
So far 85-87 % accuracy has been obtained. Typically XGBoostClassifier
, LogisticRegression
and RNN
with word to vec embedding are performing better then other classifiers.