/YouTube-Spam-Comments-Detection-

This is a machine learning model using scikit Librabry and classification algorithms

Primary LanguageJupyter NotebookCreative Commons Zero v1.0 UniversalCC0-1.0

Application area review

Prior to the implementation I did a proper Literature review to acknowledge the state of art techniques for detecting YouTube spam comments and ended up with the following findings.

Citation Techniques
(Alberto, Lochter and Almeida, 2015) NB, LR, KNN, RF
(Aiyar and Shetty, 2018) N-Gram
(Kanodia, Sasheendran and Pathari, 2018) Markov’s decision process
(Selvaraj, Konatham and Anand, 2020) LR
(Oh, 2021) Decision Tree, LR, NB, SVM
(Ruth, Khan and Reddy, 2022) RF, NB, SVM

Even though techniques like Markov’s decision process and N-Gram have been researched and used for this problem (Aiyar and Shetty, 2018; Kanodia, Sasheendran and Pathari, 2018) classification methods like Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), K-nearest Neighbors (KNN) showed a promising results (Alberto, Lochter and Almeida, 2015; Kanodia, Sasheendran and Pathari, 2018; Selvaraj, Konatham and Anand, 2020; Ruth, Khan and Reddy, 2022) which clearly stated that classification is the optimal technique to address this problem.

Part B – Compare and evaluate AI techniques.

Out of the above state of art techniques I have taken Random Forest (RF), Logistic Regression (LR) and Support Vector Machine (SVM) to compare and evaluate.

Algorithm Strength Weakness Advantage Disadvantage Input Output
Random Forest (RF) Good performance on large datasets, handling missing data and high dimensional spaces, ability to identify important features Computationally expensive, not suitable for real-time applications Ensemble method that improves overall performance, good for identifying important features Computationally expensive, not suitable for real-time applications Numerical or categorical features Binary class label (spam or not spam)
Logistic Regression (LR) Simple to implement, requires less computational resources, easily interpretable Sensitive to outliers, not robust for non-linear problems Good for binary classification problems, easy to implement and interpret Sensitive to outliers, not robust for non-linear problems Numerical or categorical features Binary class label (spam or not spam)
Support Vector Machine (SVM) Good for high dimensional spaces and non-linear problems Sensitive to choice of kernel, selection of parameters Good for classification and regression problems, useful when number of features is greater than number of samples Sensitive to choice of kernel, selection of parameters Numerical or categorical features Class label, boundary that separates the two classes

In the context of YouTube comment spam detection, RFs can be used to classify comments as spam or not spam based on various features such as the text content, user information, and comment history. LR can be used to predict the probability of a comment being spam, based on a set of features, and SVM can be used to build a model that separates spam comments from non-spam comments.

Part C – Implementation.

alt_text

Datasets

This is a public set of comments collected for spam research. It has five datasets composed of 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

Dataset YouTube ID Spam Ham Total
Psy 9bZkp7q19f0 175 175 350
KatyPerry 9bZkp7q19f0 175 175 350
LMFAO KQ6zr6kCPj8 236 202 438
Eminem uelHwf8o7_U 245 203 448
Shakira pRpeEdMmmQ0 174 196 370

As you can see the data set is balanced in most cases yet you can find a clear bar plot which shows the combined datasets balancing below.

Source: UCI Machine Learning Repository: YouTube Spam Collection Data Set

At last the confusion matrix has been used to define the performance of a classification algorithm. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). It also reflects a high matrix score mostly above 90 and we can conclude that the model is optimal for this problem.

Special thanks to UCI Machine Learning Repository for the dataset and other resources

© Saadh Jawwadh