Application area review

Prior to the implementation I did a proper Literature review to acknowledge the state of art techniques for detecting YouTube spam comments and ended up with the following findings.

Citation	Techniques
(Alberto, Lochter and Almeida, 2015)	NB, LR, KNN, RF
(Aiyar and Shetty, 2018)	N-Gram
(Kanodia, Sasheendran and Pathari, 2018)	Markov’s decision process
(Selvaraj, Konatham and Anand, 2020)	LR
(Oh, 2021)	Decision Tree, LR, NB, SVM
(Ruth, Khan and Reddy, 2022)	RF, NB, SVM

Even though techniques like Markov’s decision process and N-Gram have been researched and used for this problem (Aiyar and Shetty, 2018; Kanodia, Sasheendran and Pathari, 2018) classification methods like Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), K-nearest Neighbors (KNN) showed a promising results (Alberto, Lochter and Almeida, 2015; Kanodia, Sasheendran and Pathari, 2018; Selvaraj, Konatham and Anand, 2020; Ruth, Khan and Reddy, 2022) which clearly stated that classification is the optimal technique to address this problem.

Part B – Compare and evaluate AI techniques.

Out of the above state of art techniques I have taken Random Forest (RF), Logistic Regression (LR) and Support Vector Machine (SVM) to compare and evaluate.

Algorithm	Strength	Weakness	Advantage	Disadvantage	Input	Output
Random Forest (RF)	Good performance on large datasets, handling missing data and high dimensional spaces, ability to identify important features	Computationally expensive, not suitable for real-time applications	Ensemble method that improves overall performance, good for identifying important features	Computationally expensive, not suitable for real-time applications	Numerical or categorical features	Binary class label (spam or not spam)
Logistic Regression (LR)	Simple to implement, requires less computational resources, easily interpretable	Sensitive to outliers, not robust for non-linear problems	Good for binary classification problems, easy to implement and interpret	Sensitive to outliers, not robust for non-linear problems	Numerical or categorical features	Binary class label (spam or not spam)
Support Vector Machine (SVM)	Good for high dimensional spaces and non-linear problems	Sensitive to choice of kernel, selection of parameters	Good for classification and regression problems, useful when number of features is greater than number of samples	Sensitive to choice of kernel, selection of parameters	Numerical or categorical features	Class label, boundary that separates the two classes

In the context of YouTube comment spam detection, RFs can be used to classify comments as spam or not spam based on various features such as the text content, user information, and comment history. LR can be used to predict the probability of a comment being spam, based on a set of features, and SVM can be used to build a model that separates spam comments from non-spam comments.

Part C – Implementation.

Datasets

This is a public set of comments collected for spam research. It has five datasets composed of 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

Dataset	YouTube ID	Spam	Ham	Total
Psy	9bZkp7q19f0	175	175	350
KatyPerry	9bZkp7q19f0	175	175	350
LMFAO	KQ6zr6kCPj8	236	202	438
Eminem	uelHwf8o7_U	245	203	448
Shakira	pRpeEdMmmQ0	174	196	370

As you can see the data set is balanced in most cases yet you can find a clear bar plot which shows the combined datasets balancing below.

Source: UCI Machine Learning Repository: YouTube Spam Collection Data Set

At last the confusion matrix has been used to define the performance of a classification algorithm. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). It also reflects a high matrix score mostly above 90 and we can conclude that the model is optimal for this problem.

Special thanks to UCI Machine Learning Repository for the dataset and other resources

SaadhJawwadh/YouTube-Spam-Comments-Detection-

Application area review

Part B – Compare and evaluate AI techniques.

Part C – Implementation.

Datasets