Cyberbullying detection in text using machine learning and deep learning (v 1.0)
This repository contains python code to detect cyberbullying in text using machine learning algorithms like Support Vector Machine (SVM), deep learning models like GRU + GloVe, and RoBERTa with multi-layer perceptron on it.
Manifest
- A list of the top-level files in this project with a description of what each file is.
- images/ ----> contains images that are used in README.md
- Lakehead_RM_Cyberbullying_Detection_Project.ipynb.ipynb
----> Python Notebook contains code for the project
- README.md ----> This markdown file you are reading.
Environment
This notebook is based on python 3.0+. Most of the library comes pre-installed with Google Colab. Rest required libraries can be installed by running the first block of the code.
How to run the code
TODO
- Open the notebook in Google Colab. Here is a short video tutorial about how to use Google Colab or if you prefer reading a blog then please visit this link to learn about Google Colab.
- Once the notebook is open, you need to install the packages required for the project. Simply run the first block of the notebook, it will install all required libraries. The second block of the notebook will import all libraries required for the project.
- After this you may need to download the data using this link. Unpack this data and upload it to your Google Drive. After unpacking, size of data becomes 171 Mb that's why it cannot be uploaded to Github.
- You may need to change the path of data in the notebook. Replace the path mentioned in the notebook with your data path. Path variables look like following in the notebook
Project Description
Workflow
Following diagram shows a complete workflow.
Introduction
This repo contains a Lakehead_RM_Project
notebook which contains code for the following work.
- Data
- Data Preprocessing
- Data Analysis and plotting
- Model Building
- Model Evaluation
- Results
Dataset
The dataset which has been used for this project can be found at this website. The size of the zipped dataset is 64 mb
which contains 8 different CSV in it. There are 5 columns out of which we will be using Text
and oh_label
for our analysis and modelling purpose.
Data Pre-processing
After merging all 8 files, a single `data frame was created which has 448880 rows. But this data has many duplicate rows and blank rows along with other anomalies. The following list shows the data processing steps.
- Converted all text to lower.
- Fixed contraction like
isn't
to `is not from the text. - Removed hyperlink from the text.
- Removed punctuations from the text.
- Remove single characters except
a
. - Removed all Non-ASCII characters from text.
- Trimmed extra space from the text.
- Removed stopwords from the text.
- Balanced output label counts.
Data Analysis
- Created Word Cloud to see the most frequently occurring words with and without stopwords.
- Looked out for profanity in each sentence and plotted a bar graph to see how many sentences contain profanity in it.
- Analyzes maximum and a minimum length of sentence to create an effective model.
Model Building
1. Support Vector Machine (SVM)
Created Linear SVM and Kernel SVM as baseline machine learning to check the performance of machine learning model on text data. Both models use the following approach:
- Vectorized data using TF-IDF mechanism
- Split data into train, test and validation set
- Trained both models using Scikit learn library. Following images shows difference between Linear and Kernel SVM hypeplane
2. GRU with GloVe Embedding
GRU is a type of recurrent neural network (RNN) that works great with sequences like text.
- It can learn the long sequence of text with its special gates.
- It's well known for the understanding context of a sentence by remembering past information present in the sentence by using its gates.
- GloVe embedding stands for Global Vectors and it is a count-based, unsupervised learning model that captures both global statistics and local statistics of a corpus, to model the vector representation of words.
Following image shows a block of GRU unit.
3. RoBERTa and Multi-Layer Perceptron (MLP)
A bidirectional Encoder Representation that uses transformers as its base architecture.
- It helps to learn and predict hidden patterns in the text.
- Modification of the key hyperparameters of BERT, which includes removing the next sentence prediction objective.
- To achieve even more appropriate classification results, MLP has been added on top of RoBERTa.
Following image represents architecture of Transformers
Model Evaluation
- Data divison has taken place as follows:
- 80% of Training data.
- 10% Validation data during training.
- 10% Testing data.
- The standard size of the sentence is 150 words, however, Padding has been added to meet this average.
- GRU model has been trained for 10 epochs and RoBERTa has been trained for 5 epochs.
- We determine the effectiveness of the model, F-1 Score, accuracy, precision and recall.
Result
Following accuracy, precision, recall and F-1 score were obtained on test data
Model | Linear SVM | Kernel SVM | GRU + GloVe | RoBERTa + MLP |
---|---|---|---|---|
Accuracy | 0.856 | 0.855 | 0.841 | 0.899 |
Precision | 0.859 | 0.856 | 0.849 | 0.875 |
Recall | 0.858 | 0.855 | 0.831 | 0.831 |
F-1 Score | 0.858 | 0.855 | 0.836 | 0.881 |