text-classification-benchmark: A Python repository from davpinto

Setup

Install Anaconda or Miniconda and:

conda create -n text-class python=3.7 ipykernel unidecode nltk numpy pandas scikit-learn tensorflow-gpu keras pydot gensim sphinx -y
conda activate text-class

Running TensorBoard:

tensorboard --logdir logs --port 6006

Generating documentation:

Getting Started with Sphinx / Autodoc.

cd docs
make html

Dataset

AG's News Topic Classification Dataset.

Stratified splitting:

Training set: 102080 samples (80%)
Test set: 25520 samples (20%)

Text Preprocessing

conda activate text-class
python preporcessing.py
conda deactivate

Steps:

Lower case.
Remove accents.
Remove punctuation.
Remove numbers.
Remove single character words.
Remove english stop-words.
Remove multiple spaces.
Remove trailing and padding spaces.

Experiments

Tf-Idf + Linear SVM (Stacked with GLM)

conda activate text-class
python train_svm_model.py
conda deactivate

Parameters:

Terms: n-grams from 1 to 3.
Min. document frequency: 5.
Max. document frequency: 0.5 (50%).
Vocabulary size: 113149 (all available terms).
SVM cost: C = 0.5.
Sample weights: inverse of class proportions (weights='balanced').

Performance:

Training time: 00:02:06.
Overall Accuracy: 0.8936.
Balanced Accuracy: 0.8936.
Micro F1-score: 0.8936.
Macro F1-score: 0.8934.
Log-loss: 0.3312.

References:

Classification of text documents using sparse features.

Convolutional Neural Network

Recurrent Neural Network

Multi-channel Convolutional Neural Network

Character-level Convolutional Neural Network

References:

Deep Models for NLP beginners.

Very Deep Convolutional Neural Network

TO DO

Create skig-gram model to generate pre-trained embeddings
Use pre-trained embeddings
Create files to the Embedding Projector
Extract document embeddings from the output of the dense layer
Apply nearest neighbors on document embeddings
Apply k-means on document embeddings to find topics
Configure Tensor Board
Normalize code according to Clean ML Code ()
Create tests
Create linter
Create CI

References

Regularization: Normalization & Dropout

davpinto/text-classification-benchmark

Setup

Dataset

Text Preprocessing

Experiments

Tf-Idf + Linear SVM (Stacked with GLM)

Convolutional Neural Network

Recurrent Neural Network

Multi-channel Convolutional Neural Network

Character-level Convolutional Neural Network

Very Deep Convolutional Neural Network

TO DO

References