Install Anaconda
or Miniconda
and:
conda create -n text-class python=3.7 ipykernel unidecode nltk numpy pandas scikit-learn tensorflow-gpu keras pydot gensim sphinx -y
conda activate text-class
Running TensorBoard:
tensorboard --logdir logs --port 6006
Generating documentation:
Getting Started with Sphinx / Autodoc.
cd docs
make html
AG's News Topic Classification Dataset.
Stratified splitting:
- Training set: 102080 samples (80%)
- Test set: 25520 samples (20%)
conda activate text-class
python preporcessing.py
conda deactivate
Steps:
- Lower case.
- Remove accents.
- Remove punctuation.
- Remove numbers.
- Remove single character words.
- Remove english stop-words.
- Remove multiple spaces.
- Remove trailing and padding spaces.
conda activate text-class
python train_svm_model.py
conda deactivate
Parameters:
- Terms: n-grams from
1
to3
. - Min. document frequency:
5
. - Max. document frequency:
0.5
(50%). - Vocabulary size:
113149
(all available terms). - SVM cost:
C = 0.5
. - Sample weights: inverse of class proportions (
weights='balanced'
).
Performance:
- Training time:
00:02:06
. - Overall Accuracy:
0.8936
. - Balanced Accuracy:
0.8936
. - Micro F1-score:
0.8936
. - Macro F1-score:
0.8934
. - Log-loss:
0.3312
.
References:
References:
- Create skig-gram model to generate pre-trained embeddings
- Use pre-trained embeddings
- Create files to the Embedding Projector
- Extract document embeddings from the output of the dense layer
- Apply nearest neighbors on document embeddings
- Apply k-means on document embeddings to find topics
- Configure Tensor Board
- Normalize code according to Clean ML Code ()
- Create tests
- Create linter
- Create CI