Text Classification

This repository includes text classification works which involves following steps:

Step I: Getting and Cleaning Data

In this step data has been downloaded and cleaned. A new data frame has been created and saved as new data file. Final data file has 3117 rows. There are 5 different classes in the data file.

Python notebook: Getting and Cleaning Data

Step II: Exploratory Data Analysis

In this step some exploratory data analysis has been done. Few examples are

Plot of class frequencies
Net word frequencies plot
Plot of frequencies of words in each class
Plot of Word cloud

Class frequencies	Net word freq	Word freq per class	Word cloud

Python notebook:Exploratory Data Analysis

Step III: Model Selection and Tuning

This is the main body of the project. It includes three different model building files:

1. Model built with Bag of Words.

This python notebook walks through the model evaluation using bag of words representation of text data and picks a best model for further parameter tuining. Python notebook: Bag of Words models

2. Model built with TFIDF

This python notebook walks through the model evaluation using TF-IDF representation of text data and picks a best model for further parameter tuining. Python notebook: TF-IDF Models

3. Model built with Word to vec

This python notebook walks through the model evaluation using 'word to vec' representation of text data and and specially train a RNN model. Python notebook: Word to Vec Models

Discussion:

So far 85-87 % accuracy has been obtained. Typically XGBoostClassifier, LogisticRegression and RNN with word to vec embedding are performing better then other classifiers.

Vasuji/textClassification