Description

Compared several important models for text classification problem.

Feature
- Bag of Words (tf-idf)
- Word Embedding (Pre-trained on Google News or trained on train dataset)
Classification model
- SVM
- CNN
- RNN - LSTM or GRU

Train and test on the Reuters-21578 news dataset. Achieved reasonable classification accuracy (0.84~0.89).

To run the program:

Use Jupyter lab or Jupyter notebook to open .ipynb.

wordemb-GoogleNews + CNN
wordemb-glove + CNN
wordemb_sum + SVM

Glove twitter 100. Just sum all vectors.
tf-idf + SVM

0.89 accuracy for rbf. Single label.

Program File Structure

data_structure.py in /data_structuredefines document objects and static statistic data we would use in building models and predicting class labels.

preprocess.py in /data_preprocess module is to read data from dataset, parser text data, translate them into list of document objects which has the class labels and feature vector. Then, tokenize the words and construct a list of class labels and bag of terms.

metric.py in /metric defines importance metric for feature selection.

Workflow

Data pre-processing
1. Construct document object
2. Compute feature vector
Classification

tarahmarie/text-classification

Description

To run the program:

Program File Structure

Workflow