/ToxicCommentClassification

In the modern era of social media, toxicity in online comments poses a significant challenge, creating a negative atmosphere for communication. From abuse to insults, toxic behavior discourages the free exchange of thoughts and ideas among users. This project offers a solution to this problem.

Primary LanguageJupyter NotebookMIT LicenseMIT

Toxic Comment Classification system by "Team 16.6" πŸ›‘οΈ

Team 16.6 logo

In the modern era of social media, toxicity in online comments poses a significant challenge, creating a negative atmosphere for communication. From abuse to insults, toxic behavior discourages the free exchange of thoughts and ideas among users.

This project seeks to address this issue by developing a machine learning model to identify and classify varying levels of toxicity in comments. Leveraging the power of BERT (Bidirectional Encoder Representations from Transformers), this system aims to:

  • Analyze text for signs of toxicity
  • Classify toxicity levels effectively
  • Support moderators and users in fostering healthier and safer online communities
  • By implementing this technology, the project strives to make social media a more inclusive and positive space for interaction.

🀝 Team

Team 16.6 means that each member has an equal contribution to the project βš– .

The project was divided into tasks, which in turn were assigned to the following roles:

Desing director - Polina Mamchur

Data science - Olena Mishchenko

Backend - Ivan Shkvyr, Oleksandr Kovalenko

Frontend - Oleksii Yeromenko

Team Lead - Serhii Trush

Scrum Master - Oleksandr Kovalenko, Polina Mamchur

🎨 Desing

The project started with design development. First, design a prototype of user interface was developed:

Desing prototype

Creative director create a visually appealing application, as well as to ensure the presentation of the project to stakeholders. By focusing on both UI/UX and presentation design was able to bring the team's vision to life and effectively communicate the value of the project.

A presentation on the project has been developed and can be viewed here(in PDF format):

https://github.com/techn0man1ac/ToxicCommentClassification/blob/main/design/Presentation/ToxicCommentClassification.pdf

πŸ› οΈ Technologies

  • πŸ–ΌοΈ Figma: Online interface development and prototyping service with the ability to organize collaborative work
  • 🐍 Python: The application was developed in the Python 3.11.8 programming language
  • πŸ€— Transformers: A library that provides access to BERT and other advanced machine learning models
  • πŸ”₯ PyTorch: Libraries for working with deep learning and support GPU computing
  • πŸ“– BERT: A text analysis model used to produce contextualized word embeddings
  • ☁️ Kaggle: To save time, we used cloud computing to train the models
  • 🌐 Streamlit: To develop the user interface, used the Streamlit package in the frontend
  • 🐳 Docker: A platform for building, deploying, and managing containerized applications

πŸ–₯ Data Science

Work was performed on dataset research and data processing to train machine learning models.

πŸ“Š Dataset(EDA)

To train the machine learning models, we used Toxic Comment Classification Challenge dataset. The dataset have types of toxicity:

  • Toxic
  • Severe Toxic
  • Obscene
  • Threat
  • Insult
  • Identity Hate

The primary datasets (train.csv, test.csv, and sample_submission.csv) are loaded into Pandas DataFrames. After that make Exploratory Data Analysis of dataframes and obtained the following results:

Data toxic distribution

As you can seen from the data analysis, there is an imbalance of classes in the ratio of 1 to 10 (toxic/non-toxic).

Distribution of classes:

Class Count Percentage
Toxic 15,294 9.58%
Severe Toxic 1,595 1.00%
Obscene 8,449 5.29%
Threat 478 0.30%
Insult 7,877 4.94%
Identity Hate 1,405 0.88%
Non-toxic 143,346 89.83%
Total comments 159,571
Multiclass comments vs Total comments 18,873 11.8%

As you can see, this table shows that there is multiclassing in the data, the data of one category can belong to another category.

Here is a visualization of the data from the dataset research. Dataset in bargraph representation:

Dataset in bar graph format

Graphs show basic information about the dataset to understand the size and types of columns. Such a ratio in the data will have a very negative impact on the model's prediction accuracy.

πŸ“… Data processing

Data processing visualization

Because the original dataset includes data imbalances, this will have a bad impact on the accuracy of machine learning models, so we applied oversampling using the Sklearn package(resample function) - copying data while maintaining the balance of classes to increase the importance in the context of models recognition of a particular class.

Class Original dataset Data processing Persentage about original
Toxic 15,294 40,216 +262%
Severe Toxic 1,595 16,889 +1058%
Obscene 8,449 38,009 +449%
Threat 478 16,829 +3520%
Insult 7,877 36,080 +458%
Identity Hate 1,405 19,744 +1405%
None toxic 143,346 143,346 0%
Total 178,444 269,396 +51%

As a result, after processing the data, the balance of toxic and non-toxic for each class was ~50/50%. Thanks to this data processing, the accuracy of pattern recognition has increased by several percent.

βš™οΈ Machine learning(Back End)

To solve the challenge, we have chosen 3 popular architectures, such as BERT, DistilBERT, ALBERT, each link takes you to the source code as trained by the model.

Here is a visual representation of the main parameters of the models:

Model metrics comparison

Here is a detailed description of each of the machine learning models we trained:

BERT ֎

This project demonstrates toxic comment classification using the bert-base-uncased model from the BERT family. To automate the process of selecting hyperparameters, hepl us Optuna.

1. Toxic Comment Classification with BERT

  • Utilized the bert-base-uncased model with PyTorch for flexibility and ease of use.
  • Seamlessly integrated with Hugging Face Transformers.
  • Training accelerated by ~30x using a GPU, efficiently handling BERT’s computational demands.

2. Dataset Balancing

  • Addressed dataset imbalance (90% non-toxic, 10% toxic) using oversampling with sklearn.
  • Ensured rare toxic categories received equal attention by balancing class distributions.
  • Improved model performance in recognizing rare toxic classes.

3. Key Techniques

  • Tokenization: Preprocessed data tokenized using BertTokenizer.
  • Loss Function: Used BCEWithLogitsLoss with weighted loss for rare class emphasis.
  • Gradient Clipping: Optimized training stability with gradient clipping (max_norm).
  • Hyperparameter Tuning: Tuned batch size, learning rate, and epochs using Optuna.

4. Threshold Optimization

  • Used itertools.product to find optimal thresholds for each class.
  • Improved recall and F1-score (by 1-1.5%) for better multi-label classification.

5. Performance and Key Model Details

  • Validation Metrics:

    • Accuracy: 0.95 βœ…
    • Precision: 0.97 βœ…
    • Recall: 0.96 βœ…
  • Model Specifications:

    • Vocabulary Size: 30,522
    • Hidden Size: 768
    • Attention Heads: 12
    • Hidden Layers: 12
    • Total Parameters: 110M
    • Maximum Sequence Length: 512(in this case use 128 tokens)
    • Pre-trained Tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

DistilBERT ֎

This project demonstrates toxic comment classification using the distilbert-base-uncased model, a lightweight and efficient version of BERT.

1. Using PyTorch

  • Selected for its flexibility, ease of use, and strong community support.
  • Seamlessly integrated with Hugging Face Transformers.

2. Dataset Balancing

  • Addressed dataset imbalance (90% non-toxic, 10% toxic) using sklearn.utils.resample.
  • Applied stratified splitting for training and test datasets.
  • Oversampled rare toxic classes, improving model recognition of all categories.

3. Key Techniques

  • Tokenization: Preprocessed data with DistilBertTokenizer.
  • Loss Function: Binary Cross-Entropy with Logits (BCEWithLogitsLoss).
  • Hyperparameter Tuning: Optimized batch size (16), learning rate (2e-5), and epochs (3) with Optuna.

4. Accelerated Training

  • Utilized GPU for training, achieving a ~30x speedup over CPU.

5. Threshold Optimization

  • Used itertools.product to determine optimal thresholds for each class.
  • Improved recall and F1-score for multi-label classification.

6. Performance and Key Model Details

  • Validation Metrics:

    • Accuracy: 0.92 βœ…
    • Precision: 0.79 βœ…
    • Recall: 0.78 βœ…
  • Model Specifications:

    • Vocabulary Size: 30522
    • Hidden Size: 768
    • Attention Heads: 12
    • Hidden Layers: 6
    • Total Parameters: 66M
    • Maximum Sequence Length: 512(in this case use 128 tokens)
    • Pre-trained Tasks: Masked Language Modeling (MLM).

ALBERT ֎

This project demonstrates toxic comment classification using the albert-base-v2 model, a lightweight and efficient version of BERT designed to reduce parameters while maintaining high performance.

1. Using PyTorch

  • Selected for its flexibility, ease of use, and strong community support.
  • Seamlessly integrated with Hugging Face Transformers.

2. Dataset Balancing

  • Addressed dataset imbalance (90% non-toxic, 10% toxic) using sklearn.utils.resample.
  • Applied stratified splitting for training and test datasets.
  • Oversampled rare toxic classes to improve model recognition of all categories.

3. Key Techniques

  • Tokenization: Preprocessed data with AlbertTokenizer.
  • Loss Function: Binary Cross-Entropy with Logits (BCEWithLogitsLoss).
  • Hyperparameter Tuning: Optimized batch size (8), learning rate (2e-5), and epochs (3) using Optuna.

4. Accelerated Training

  • Utilized GPU for training, achieving a ~30x speedup over CPU.

5. Threshold Optimization

  • Used itertools.product to determine optimal thresholds for each class.
  • Enhanced recall and F1-score for multi-label classification.

6. Performance and Key Model Details

  • Validation Metrics:

    • Accuracy: 0.92 βœ…
    • Precision: 0.84 βœ…
    • Recall: 0.69 βœ…
  • Model Specifications:

    • Vocabulary Size: 30000
    • Hidden Size: 768
    • Attention Heads: 12
    • Hidden Layers: 12
    • Intermediate Size: 4096
    • Total Parameters: 11M
    • Maximum Sequence Length: 512(in this case use 128 tokens)
    • Pre-trained Tasks: Masked Language Modeling (MLM).

We used to Cloud computing on Kaggle for are speed up model training.

πŸ’» How to install

There are two ways to install the application on your computer:

Simple 😎

Download Docker -> Log in to your profile in the application -> Open the Docker terminal(bottom of the program) -> Enter command:

docker pull techn0man1ac/toxiccommentclassificationsystem:latest

After that, all the necessary files will be downloaded from DockerHub -> Go to the Images tab -> Launch the image by clicking Run -> Click Optional settings -> Set the host port 8501

Set host port 8501

Open http://localhost:8501 in your browser.

Like a pro πŸ’ͺ

This way need from you, to have some skills with command line, GitHub and Docker.

  1. Cloning a repository:
git clone https://github.com/techn0man1ac/ToxicCommentClassification.git
  1. Download the model files from this link, after downloading the albert, bert and distilbert directories, put them in the frontend\saved_models directory, like that:

Catalog with models

  1. Open a command line/terminal and navigate to the ToxicCommentClassification directory, and pack the container with the command:
docker-compose up
  1. After which the application will immediately start and a browser window will open with the address http://localhost:8501

To turn off the application, run the command:

docker-compose down

πŸš€ How to use(Front End)

After launching the application, you will see the project's home tab with a description of the application and the technologies used in it. The program looks like this when running:

Models test

The application interface is intuitive and user-friendly.

The structure of the tabs is as follows:

  • Home - Here you can find a description of the app, the technologies used for its operation, the mission and vision of the project, and acknowledgments
  • Team - This tab contains those without whom the app would not exist, its creators
  • Metrics - In this tab, you can choose one of 3 models, after selecting it, the technical characteristics of each of the machine learning models are loaded
  • Classification - A tab where you can test the work of models.

Models test - That f@@ing awesome

The main elements of the interface:

  • Choose your model - A drop-down list where you can select one of 3 pre-trained machine learning models
  • Enter your comment here - In this field you can manually write a text to test it for toxicity and further classify it in a positive case
  • Upload your text file - By clicking here, a dialog box will appear with the choice of a file in txt format (after uploading the file, the text in the text field is ignored)
  • Display detailed toxicity - A checkbox that displays a detailed classification by class if the model considers the text to be toxic

Classify tab interface

The application interface is intuitive and user-friendly. The application is also able to classify text files in the txt format.

The app written with the help of streamlit provides a user-friendly interface to observe and try out functionality of the included BERT-based models for comment toxicity classification.

🎯 Mission

The mission of our project is to create a reliable and accurate machine learning model that can effectively classify different levels of toxicity in online comments. We plan to use advanced technologies to analyze text and create a system that will help moderators and users create healthier and safer social media environments.

🌟 Vision

Our vision is to make online communication safe and comfortable for everyone. We want to build a system that not only can detect toxic comments, but also helps to understand the context and tries to reduce the number of such messages. We want to create a tool that will be used not only by moderators, but also by every user to provide a safe environment for the exchange of thoughts and ideas.

πŸ“œ Licence

This project is a group work published under the MIT license , and all project contributors are listed in the license text.

πŸ‘ Acknowledgments

This project was developed by a team of professionals as a graduation thesis of the GoIT Python Data Science and Machine Learning course.

Thank you for exploring our project! Together, we can make online spaces healthier and more respectful.