Simple text classifier created in Python 3.10, that has been trained on CreepyPasta stories.
Assumes you've already run Creepy Crawler at least once, and thus have created the creepystore
database and filled its stories
table.
It does the following things:
- Mines
tokens
,spans
andsentiments
from each story and stores the results to the corresponding DB tables. - Calculates and plots coherence scores, to determine the correct model configuration.
- Trains two models, one using the existing category data and one using each story's actual text.
- Makes predictions based on a list of input texts (ideally, this should not be included in the training set).
The main goal of this project is to group and categorize CreepyPasta stories in an automatic way.
- spacy 3.6.1
- mysql 0.0.3
- mysql-connector-python 8.1.0
- gensim 4.3.1
- spacytextblob 4.0.0
- pyenchant 3.2.2
- matplotlib 3.7.2
- a locally installed MySQL server instance, with a database called
creepystore
and thestories
table already filled.
Any one of the already trained spacy models should also be installed, using the following command with the correct model name:
python -m spacy download en_core_web_md
The TextProcessor
class acts as an entry point for this process, but the actual tokenization (together with sentiment extraction, for optimization purposes) is done in Tokenizer
.
It is called from main.py
, using:
TextProcessor.mine_text(story_data)
Before it gets called, StoryReader.get_stories()
must also be invoked, in order to return base story data to be processed.
You can keep MINE_TEXT = False
in definitions.py
, if you've already run this and you have its results stored in the database.
This process is managed by the CoherenceCalculator
class, and it is performed separately for categories and separately for the actual texts of the stories.
It is called from main.py
, using either the following for categories:
CoherenceCalculator.calculate_cat(story_items)
or the following for the actual story texts:
CoherenceCalculator.calculate_txt(story_items)
You can enable either process by setting CATEGORY_COHERENCE = True
and STORY_COHERENCE = True
in definitions.py
.
Both processes are managed by the Classifier
class.
Both of them are initiated in main.py
, training by calling:
classifier.train_models()
and predictions by calling:
classifier.make_predictions()
After training, models are stored in the resources
folder and are loaded from there when predictions are requested.
You can enable model training by setting TRAIN_MODELS = True
, and you can test that the predictions code is working (albeit using the training set) by setting TEST_PREDICTIONS = True
in definitions.py
.