Prior to the implementation I did a proper Literature review to acknowledge the state of art techniques for detecting YouTube spam comments and ended up with the following findings.
Citation | Techniques |
(Alberto, Lochter and Almeida, 2015) | NB, LR, KNN, RF |
(Aiyar and Shetty, 2018) | N-Gram |
(Kanodia, Sasheendran and Pathari, 2018) | Markov’s decision process |
(Selvaraj, Konatham and Anand, 2020) | LR |
(Oh, 2021) | Decision Tree, LR, NB, SVM |
(Ruth, Khan and Reddy, 2022) | RF, NB, SVM |
Even though techniques like Markov’s decision process and N-Gram have been researched and used for this problem (Aiyar and Shetty, 2018; Kanodia, Sasheendran and Pathari, 2018) classification methods like Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), K-nearest Neighbors (KNN) showed a promising results (Alberto, Lochter and Almeida, 2015; Kanodia, Sasheendran and Pathari, 2018; Selvaraj, Konatham and Anand, 2020; Ruth, Khan and Reddy, 2022) which clearly stated that classification is the optimal technique to address this problem.
Out of the above state of art techniques I have taken Random Forest (RF), Logistic Regression (LR) and Support Vector Machine (SVM) to compare and evaluate.
Algorithm | Strength | Weakness | Advantage | Disadvantage | Input | Output |
Random Forest (RF) | Good performance on large datasets, handling missing data and high dimensional spaces, ability to identify important features | Computationally expensive, not suitable for real-time applications | Ensemble method that improves overall performance, good for identifying important features | Computationally expensive, not suitable for real-time applications | Numerical or categorical features | Binary class label (spam or not spam) |
Logistic Regression (LR) | Simple to implement, requires less computational resources, easily interpretable | Sensitive to outliers, not robust for non-linear problems | Good for binary classification problems, easy to implement and interpret | Sensitive to outliers, not robust for non-linear problems | Numerical or categorical features | Binary class label (spam or not spam) |
Support Vector Machine (SVM) | Good for high dimensional spaces and non-linear problems | Sensitive to choice of kernel, selection of parameters | Good for classification and regression problems, useful when number of features is greater than number of samples | Sensitive to choice of kernel, selection of parameters | Numerical or categorical features | Class label, boundary that separates the two classes |
In the context of YouTube comment spam detection, RFs can be used to classify comments as spam or not spam based on various features such as the text content, user information, and comment history. LR can be used to predict the probability of a comment being spam, based on a set of features, and SVM can be used to build a model that separates spam comments from non-spam comments.
This is a public set of comments collected for spam research. It has five datasets composed of 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.
Dataset | YouTube ID | Spam | Ham | Total |
Psy | 9bZkp7q19f0 | 175 | 175 | 350 |
KatyPerry | 9bZkp7q19f0 | 175 | 175 | 350 |
LMFAO | KQ6zr6kCPj8 | 236 | 202 | 438 |
Eminem | uelHwf8o7_U | 245 | 203 | 448 |
Shakira | pRpeEdMmmQ0 | 174 | 196 | 370 |
As you can see the data set is balanced in most cases yet you can find a clear bar plot which shows the combined datasets balancing below.
Source: UCI Machine Learning Repository: YouTube Spam Collection Data Set
At last the confusion matrix has been used to define the performance of a classification algorithm. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). It also reflects a high matrix score mostly above 90 and we can conclude that the model is optimal for this problem.
Special thanks to UCI Machine Learning Repository for the dataset and other resources
© Saadh Jawwadh