/lda_classification

A python package that aims to make LDA topic modelling even easier for you!

Primary LanguagePythonMIT LicenseMIT

lda_classifcation

Instantly train an LDA model with a scikit-learn compatible wrapper around gensim's LDA model.

  • Preprocess Your Documents
  • Train an LDA
  • Evaluate Your LDA Model
  • Extract Document Vectors
  • Select the Most Informative Features
  • Classify Your Documents

All in a few lines of code, completely compatible with sklearn's Transformer API.


Installation:

If you want to install via Pypi use the following command:

pip install lda_classification

If you want to install from the sourcefile:

git clone https://github.com/FeryET/lda_classification.git
cd lda_classification/
python setup.py install

Requirements:

gensim == 3.8.0
matplotlib == 3.1.2
numpy == 1.19.1
setuptools~=49.6.0
spacy == 2.3.1
tqdm == 4.48.2
scikit-learn~=0.23.1
tomotopy~=0.9.1
Optional:

If you want to automate the feature selection using this package you can also install xgboost to use the util class.

xgboost == 1.1.1 (Optional)

Example:

from lda_classification.model import GensimLDAVectorizer
from lda_classification.preprocess import SpacyCleaner
from lda_classification.utils import XGBoostFeatureSelector

# docs, labels = FETCH YOUR DATASET 
# y = ENCODED_LABELS
docs = SpacyCleaner().transform(docs)
X = GensimLDAVectorizer(200, return_dense=False).fit_transform(docs)
X_transform = XGBoostFeatureSelector().fit_transform(X, y)

There is also a dataloader class and a BaseData class in order to automate reading your data files from disk. Extend BaseData and implement the abstractmethods in the subclass and feed it to DataReader to simplify fetching your dataset.