Arabic_Sentiment_Analysis

(generated using MidJourney)

Arabic Sentiment Analysis

Overview

A Natural Language Processing model that Analyzes Arabic tweets (with arbitrary dialects) and classifies them based on stance (positive, negative or neutral) and category (news, rumors, advice, unrelated, ..etc).

Built With

Project Pipeline

1. Methodology

1.1 Data Preprocessing

Simple investigation revealed duplicate tweets in the data set that have different classes, we solved this as follows:

For a single group of duplicate tweets, we replaced them with a single instance and assign to it the top frequent class in such group.

After that we proceed to the data preprocessing step which is done on 3 levels

Cleaning Text: Using regex and simple filteration code, we managed to do the following:

Processing	Example
Remove line break tags	`<LF>` , `<CR>`
Reduce consecutive duplicate letters to only 2	"سمعليكووووووو" => "سمعليكوو
Remove URLs and mentions	`https://btly.blabla` , `@USER`
Remove hashtag sign	`#هيئة_الصحة_العالمية` <= `هيئة_الصحة_العالمية`
Remove numbers, brackets, and qutoes
Remove english letters	`USER`
Remove abbreviations	`أ.د محمد => محمد`
Replace _ and - with space	`هيئة_الصحة-العالمية` => `هيئة الصحة العالمية`
Dediacterize Text	`لَقَاحُ مَرَضِ كَوَّرُونَا` => `لقاح مرض كورونا`

Tokenizing: With the help of camel-tools, we were able to do the following:
- Detect dialect of the tweet, we used this information in the next two points .
- Perform lemmatization.
- Perform tokenization.
Normalizing Characters:
- Normalize alef and ya'
Replacement character

ا أ , إ , ا , آ

ى ى , ي , ئ
- Replace emoji with equivalent text 😂 -> face_tearing_with_joy

Replacement	character
ا	أ , إ , ا , آ
ى	ى , ي , ئ

1.2 Feature Extraction

We have extracted the following features to be used for classification:

1-TF-IDF

Have extracted the Term Frequency-Inverse Term Frequency measure for unigrams, bigrams, trigrams, a combination of unigrams and bigrams, and a combination of unigrams and trigrams using sklearn.feature_extraction.text.TfidfVectorizer

2- BOW

Represented the corpus as a Bag of Words using sklearn.feature_extraction.text.CountVectorizer

3- Dialect

We thought that there could be a correlation between if the dialect was MSA or not and with the category being from a more formal source like “Info news”, “plans”
Used camel tools dialect detector to predict the dialect and used it as an additional feature
Did not have a significant effect

4- Word Embeddings

We have also used FastText word embeddings to feed the RNN and LSTM models

1.3 Building Models

Classical Models	Sequence Models	Transformer-Based Models
Multinomial Naive Base	RNN	MarBERT
SVM	LSTM	AraBERT
Random Forest

SVM hyperparameter tuning

We used GridSearchCV sklearn function for hyperparameter tuning with the following search space:

Parameter	Possible Values
C	[0.1, 1, 10]
Gamma	[0.1, 1]
Kernel	['Sigmoid', 'rbf', 'linear', 'poly']
Tol	[0.001, 0.0001]

AraBert Training Arguments

Argument	Value
learning_rate	1e-4
optimizer	Adam Optimizer
adam_epsilon	1e-8
batch size	64
epochs	5
metric	macro F1

MarBert Training Arguments

Argument	Value
learning_rate	2e-06
optimizer	Adam Optimizer
adam_epsilon	2e-06
batch size	32
epochs	5
metric	macro F1

2 Results

Here we highlight some of the approaches and combinations and their results on the given data. It is worth mentioning that both the training and testing datasets had a lot of misclassifications.

For all the approaches, the different output options of the preprocessing steps were tried and we found that they had a very minor impact on the results. For instance, the difference between using the dialect detection and not using it was always about 0.01 in the F1 score. the same goes for keeping the emojis and digits in the text or not.

The main metric used to evaluate the models is the F1 score.

2.1 Stance Classification

That is, over the 3 classes: positive, negative and neutral.

Approach	Accuracy	Precision	Recall	F1 Score
Classical Models (with TF-IDF):	______	______	______	______
Random Forest (TF-IDF - no SMOTE)	0.80	0.65	0.37	0.37
Multinomial Naive Bayes (TF-IDF - SMOTE)	0.81	0.60	0.57	0.59
SVM (TF-IDF - SMOTE, Sigmoid Kernel)	0.82	0.62	0.61	0.61
Sequential Models (with FastText):	______	______	______	______
LSTM (FastText Embeddings)	0.79	0.35	0.33	0.30
RNN (FastText Embeddings)	0.76	0.47	0.41	0.40
Transformer-based Models:	______	______	______	______
MarBERT	0.83	0.69	0.60	0.63
araBERT	0.83	0.63	0.62	0.63

Note: Precision, Recall and F1 Score are calculated using the macro average method.

2.2 Category Classification

That is, over the 10 classes, news, celebrities, plan, request, rumor, advice, restriction, personal, unrelated and others.

Approach	Accuracy	Precision	Recall	F1 Score
Classical Models (with TF-IDF):	______	______	______	______
Random Forest (TF-IDF - no SMOTE)	0.58	0.34	0.17	0.18
Multinomial Naive Bayes (TF-IDF - SMOTE)	0.62	0.44	0.36	0.39
SVM (TF-IDF - SMOTE, Sigmoid Kernel)	0.64	0.47	0.37	0.40
Sequential Models (with FastText):	______	______	______	______
LSTM (FastText Embeddings)	0.51	0.05	0.10	0.07
RNN (FastText Embeddings)	0.60	0.17	0.18	0.17
Transformer-based Models:	______	______	______	______
MarBERT	0.69	0.41	0.35	0.37
araBERT	0.68	0.41	0.36	0.38