Cyberbullying detection in text using machine learning and deep learning (v 1.0)

This repository contains python code to detect cyberbullying in text using machine learning algorithms like Support Vector Machine (SVM), deep learning models like GRU + GloVe, and RoBERTa with multi-layer perceptron on it.

Manifest

A list of the top-level files in this project with a description of what each file is.

- images/     ----> contains images that are used in README.md

- Lakehead_RM_Cyberbullying_Detection_Project.ipynb.ipynb 
              ----> Python Notebook contains code for the project

- README.md   ----> This markdown file you are reading.

Environment

This notebook is based on python 3.0+. Most of the library comes pre-installed with Google Colab. Rest required libraries can be installed by running the first block of the code.

How to run the code

TODO

Open the notebook in Google Colab. Here is a short video tutorial about how to use Google Colab or if you prefer reading a blog then please visit this link to learn about Google Colab.
Once the notebook is open, you need to install the packages required for the project. Simply run the first block of the notebook, it will install all required libraries. The second block of the notebook will import all libraries required for the project.
After this you may need to download the data using this link. Unpack this data and upload it to your Google Drive. After unpacking, size of data becomes 171 Mb that's why it cannot be uploaded to Github.
You may need to change the path of data in the notebook. Replace the path mentioned in the notebook with your data path. Path variables look like following in the notebook

6. Also, this notebook requires GloVe 300d embedding, to download this embedding please visit this link. Download the GloVe weight from the link and upload it to your Google Drive. You may need to change the path of variable `glove_path`. Assign the google drive path of your glove model in the variable `glove_path. After this simply run all the blocks of the notebook to reproduce the results.

Project Description

Workflow

Following diagram shows a complete workflow.

Introduction

This repo contains a Lakehead_RM_Project notebook which contains code for the following work.

Data
Data Preprocessing
Data Analysis and plotting
Model Building
Model Evaluation
Results

Dataset

The dataset which has been used for this project can be found at this website. The size of the zipped dataset is 64 mb which contains 8 different CSV in it. There are 5 columns out of which we will be using Text and oh_label for our analysis and modelling purpose.

Data Pre-processing

After merging all 8 files, a single `data frame was created which has 448880 rows. But this data has many duplicate rows and blank rows along with other anomalies. The following list shows the data processing steps.

Converted all text to lower.
Fixed contraction like isn't to `is not from the text.
Removed hyperlink from the text.
Removed punctuations from the text.
Remove single characters except a.
Removed all Non-ASCII characters from text.
Trimmed extra space from the text.
Removed stopwords from the text.
Balanced output label counts.

Data Analysis

Created Word Cloud to see the most frequently occurring words with and without stopwords.
Looked out for profanity in each sentence and plotted a bar graph to see how many sentences contain profanity in it.
Analyzes maximum and a minimum length of sentence to create an effective model.

Model Building

1. Support Vector Machine (SVM)

Created Linear SVM and Kernel SVM as baseline machine learning to check the performance of machine learning model on text data. Both models use the following approach:

Vectorized data using TF-IDF mechanism
Split data into train, test and validation set
Trained both models using Scikit learn library. Following images shows difference between Linear and Kernel SVM hypeplane

2. GRU with GloVe Embedding

GRU is a type of recurrent neural network (RNN) that works great with sequences like text.

It can learn the long sequence of text with its special gates.
It's well known for the understanding context of a sentence by remembering past information present in the sentence by using its gates.
GloVe embedding stands for Global Vectors and it is a count-based, unsupervised learning model that captures both global statistics and local statistics of a corpus, to model the vector representation of words.

Following image shows a block of GRU unit.

3. RoBERTa and Multi-Layer Perceptron (MLP)

A bidirectional Encoder Representation that uses transformers as its base architecture.

It helps to learn and predict hidden patterns in the text.
Modification of the key hyperparameters of BERT, which includes removing the next sentence prediction objective.
To achieve even more appropriate classification results, MLP has been added on top of RoBERTa.

Following image represents architecture of Transformers

Model Evaluation

Data divison has taken place as follows:

80% of Training data.
10% Validation data during training.
10% Testing data.

The standard size of the sentence is 150 words, however, Padding has been added to meet this average.
GRU model has been trained for 10 epochs and RoBERTa has been trained for 5 epochs.
We determine the effectiveness of the model, F-1 Score, accuracy, precision and recall.

Result

Following accuracy, precision, recall and F-1 score were obtained on test data

Model	Linear SVM	Kernel SVM	GRU + GloVe	RoBERTa + MLP
Accuracy	0.856	0.855	0.841	0.899
Precision	0.859	0.856	0.849	0.875
Recall	0.858	0.855	0.831	0.831
F-1 Score	0.858	0.855	0.836	0.881

girijesh97/LU_RM_Project