Welcome!

This is the documentation for the project which aims to perform a comprehensive data analysis, using the dataset available from Kaggle, to identify instances of toxic content within various comments on Wikipedia. This task is important for promoting safety and a more inclusive online space for the users.

Content

asset_count.py -> Different functions for calculating rows per labels and their values.
data_exploration.py -> A wrapper for calling functions that explain the training dataset.
feature_extraction.py -> Used for splitting the data and tokenize for training and testing of the models. Can be personalized to use specific vectorizers.
modelling.py -> Functions for training and evaluating the models.
starter.py -> The main function. This can be personalized to perform only data exploration or model trainig with specific models.
text_preprocessing.py -> Functions for tokenization and storing a clean dataset used in later steps.

Setup environment

Install miniconda, Azure CLI.
Create environment. conda env create -f environment.yml
Select interpreter in VS Code to be the newly created environment
In terminal, run conda activate nlp-aimsc
Set up starter.py with the necessary model name you want to run.
Set up feature_extraction.py, splitting() function with the specific vectorizer you want to use.
In terminal, run python starter.py

ExOblivione/textAnalytics

Welcome!

Content

Setup environment