/EDA_for_hate_speech

Using textual data augmentation in NLP, to explore deep learning model performance on an imbalanced hate speech dataset. Tools used Python, Keras, Scikit-learn & Tensorflow.

Primary LanguageJupyter Notebook

EDA_for_hate_speech

Abstract - Using textual data augmentation in NLP, to improve deep learning model performance on an imbalanced hate speech dataset. Tools used Python, Keras, Scikit-learn & Tensorflow.

Quick links

Key Terms

Data imbalance:

When classes in a dataset are highly unequally represented with one class being greater than other or others by a wide margin.

Data Augmentation:

Entails using available data to create additional synthetic data samples.

Problem

With an imbalanced dataset involving a classification task, where the minority class is essential to the problem being tackled. Abstractions learnt by the classification algorithm are skewed towards the majority class, so performance is poorer in the underrepresented class. As consequence the key performance metric is poor(the minority class)

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Dataset Used:

The dataset chosen for this task is the hatebase dataset Davidson et al. 2017 this imbalanced dataset was annotated by crowd flower from a random sample 25,000 tweets. It consists of three classes

  • Hate - 0
  • Offensive - 1
  • Neither - 2

Data Augmentation Method Used:

Easy Data Augmentation are a group of 4 simple data augmentation techniques that are easy to implement created by Jason Wei. Listed as follows:

Synonym Replacement (SR): Goes through each sentence, selects a number of words “n” (not stop words) at random Replaces each selected word with one of its own random synonyms.

Random Swap (RS): It iterates through each sentence selecting two words randomly and swapping their positions, this is done “n” times.

Random Deletion (RD): It goes through each sentence in the corpus, with a probability “p” each word in the sentence is removed or not based based on “p”.

Random Insertion (RI): Goes through each sentence, selecting a number of words “n” (not stop words) at random and adding a synonym of each selected word at random positions in the sentence.

Model Pipeline:

The LSTM model architecture consist of 6 layers: 1 embedding layer, 1 spatial drop out layer, a 128 neuron LSTM layer with input dropout and recurrent drop out, and 1 dense layer with relu activation function, 1 dropout layer and an output dense layer with a soft max function.

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Running the code.

To run the code make sure you have the following :

Install NLTK

Download Wordnet

Please note: SR = Synonym Replacement, RS = Random Swap, RD = Random Deletion, RI = Random Insertion, DA = Dataset, HS = Hate Speech

After which proceed to

Download the .ipynb file titled '000_DA_HS' & dataset folder

Download the .ipynb file titled '000_DA_HS.ipynb' in the code. Also download the dataset folder which contains the test, val, train, and train augmented files

Open the 000_DA_HS.ipynb

Open 000_DA_HS.ipynb proceed to the second block of code to change the address for the test, validation and (train data - choose sample of choice)

The source test, validation and train data are titled:

Test = test_data_hs | Train = train_data_hs | Validation = val_data_hs

Run the 000_DA_HS.ipynb code file

Click run till you reach the code block for training and observe the results. To run a different dataset all you do is change the address for the train data on the second line in the code

Running code with Augmented datasets?

There are 15 concatenated combinations made based on the four EDA techniques, as such we have 15 augmented training dataset. They are grouped by number of augmentations performed each seperated with a comma(,). They are listed as follows

Single Augmentation

SR, RS, RD, RI

Double Augmentation

SR_RS, SR_RD, SR_RI, RS_RD, RS_RI, RD_RI

Three or more Augmentation

SR_RS_RD, SR_RS_RI, SR_RI_RD, RI_RS_RD, SR_RS_RD_RI

//////////////////////////////////////////////////////

Please note: each augmented train data has the term "train_hs" before its corresponding name

//////////////////////////////////////////////////////

Results Obtained:

Comparing the results obtained using the baseline dataset and the other augmented datasets using a variety of performance metrics(accuracy, precision,recall, F1 Score, Hate Recall)

Best percentage improvement (base line dataset versus top perfroming dataset) in each metric category

Overall table of metric performance:

Findings from metric performance:

  • Correlation between high accuracy and low hate recall
  • Correlation between high precision + high recall and high F1 score
  • Correlation between RD and high hate recall
  • Correlation between RI and low hate recall
  • Reduction in the rate of improvement from augmentation class method to another