Spam Email Classification | Adversarial Attacks-

1. A comparative study of Machine learning classifiers and deep neural network learning algorithms applied to the problem of Spam emails

2. Exploration of Adversarial attacks on spam email classification learning algorithms

The Methodology/Pipeline of the system is illustrated below :

Data Preprocessing
Model Training
Model Testing and Evaluation

Machine Learning Classifiers :

Multinomial Naive Bayes
Logistic Regression
Support Vector Machines ( linear and Radial basis function )
K-Nearest Neighbours
K-means Clustering
Random Forest
Gradient Boosting
XGBoost
Decision Tree

Deep Learning Algorithms :

DNN ( with word Embeddings )
RNN ( with word Embeddings )
CNN ( with word Embeddings )
DNN ( with word Pretrained Glove Embeddings )
RNN ( with word Pretrained Glove Embeddings )
CNN ( with word Pretrained Glove Embeddings )

Feature Extraction techniques for Machine Learning Classifiers :

tf-idf -> Term Frequency , Inverse Term Frequency

Feature Extraction techniques for Deep Learning Classifiers :

Word Embeddings ( Trainable ) and ( Non - Trainable : glove )

The Glove Embedding Vectors must be downloaded from and placed in the root directory for use.

Metrics for Evaluation :

Accuracy
f1-Score
Precision
Recall
ROC-AUC
Error and Loss

Adversarial Attacks -

Label Flipping

Sample Poisoning (Synonym Replacement, Spam / Ham word Injection

Algorithm 2-1 : Synonym Replacement :

Algorithm 2-2 : Spam / Ham word Injection :

Algorithm 3 : Addition of new Poisoned Emails algorithm (To do):

Adversarial Attacks on Deep Learning Models (To do):

Fast Gradient Method (FGM)

Fast Gradient Sign Method (FGSM)

L2 Projected Gradient Descent (PGD)

Linf Projected Gradient Descent (LinfPGD)

Adversarial attacks defensive mechanism (To do):

Application of KNN as Defence

Directory Structure of the repo :

.
|-- Architecture_Diagrams
|   |-- ann_1.jpeg
|   |-- ann_glove_1.jpeg
|   |-- ANN_glove_1.jpeg
|   |-- ann_glove_2.jpeg
|   |-- ANN_glove_2.jpeg
|   |-- ANN_word_Embedding.jpeg
|   |-- cnn_1.jpeg
|   |-- cnn_glove_1.jpeg
|   |-- CNN_glove_1.jpeg
|   |-- cnn_glove_2.jpeg
|   |-- CNN_glove_2.jpeg
|   |-- CNN_word_Embedding.jpeg
|   |-- rnn_1.jpeg
|   |-- rnn_glove_1.jpeg
|   |-- RNN_glove_1.jpeg
|   |-- rnn_glove_2.jpeg
|   |-- RNN_glove_2.jpeg
|   `-- RNN_word_Embedding.jpeg
|-- comparison
|   |-- classifiers (All Saved Models)
|   |   |-- Decision_tree.pkl
|   |   |-- Gradient_Boosting.pkl
|   |   |-- KNN.pkl
|   |   |-- Logistic_regression.pkl
|   |   |-- MultinomialNB.pkl
|   |   |-- Random_forest.pkl
|   |   |-- SVM_linear.pkl
|   |   |-- SVM_RBF.pkl
|   |   |-- train_test_tf_idf.pkl
|   |   `-- XGBoost.pkl
|   |-- All_models_and_classifiers.csv
|   |-- Classifier_Metrics_Comparison.csv
|   |-- Classifier_Metrics_Comparison_percentage.csv
|   |-- DNN_glove_1_comparison.csv
|   |-- DNN_glove_2_comparison.csv
|   |-- DNN_Models.csv
|   |-- DNN_Models_percent.csv
|   |-- DNN_Trainable_Embeddings_comparison.csv
|   |-- Gradient_boosting_hyperparameters.csv
|   |-- KNN_hyperparameters.csv
|   |-- Label_flip_Adversarial.csv
|   |-- Metrics_Comparison.csv
|   |-- Spam_Ham_Injection_Adversarial.csv
|   |-- Spam_Ham_Injection_random_Adversarial.csv
|   |-- Synonym_Adversarial.csv
|   `-- Tuned_KNN_GB.csv
|-- Datasets
|   |-- archive
|   |   `-- emails.csv
|   `-- spam_dataset_1
|       `-- emails.csv
|-- EDA
|   |-- LEAST_Word_Counts_0.jpeg
|   |-- LEAST_Word_Counts_10.jpeg
|   |-- LEAST_Word_Counts_11.jpeg
|   |-- LEAST_Word_Counts_12.jpeg
|   |-- LEAST_Word_Counts_13.jpeg
|   |-- LEAST_Word_Counts_14.jpeg
|   |-- LEAST_Word_Counts_15.jpeg
|   |-- LEAST_Word_Counts_16.jpeg
|   |-- LEAST_Word_Counts_17.jpeg
|   |-- LEAST_Word_Counts_18.jpeg
|   |-- LEAST_Word_Counts_19.jpeg
|   |-- LEAST_Word_Counts_1.jpeg
|   |-- LEAST_Word_Counts_20.jpeg
|   |-- LEAST_Word_Counts_21.jpeg
|   |-- LEAST_Word_Counts_22.jpeg
|   |-- LEAST_Word_Counts_23.jpeg
|   |-- LEAST_Word_Counts_24.jpeg
|   |-- LEAST_Word_Counts_25.jpeg
|   |-- LEAST_Word_Counts_26.jpeg
|   |-- LEAST_Word_Counts_27.jpeg
|   |-- LEAST_Word_Counts_28.jpeg
|   |-- LEAST_Word_Counts_29.jpeg
|   |-- LEAST_Word_Counts_2.jpeg
|   |-- LEAST_Word_Counts_3.jpeg
|   |-- LEAST_Word_Counts_4.jpeg
|   |-- LEAST_Word_Counts_5.jpeg
|   |-- LEAST_Word_Counts_6.jpeg
|   |-- LEAST_Word_Counts_7.jpeg
|   |-- LEAST_Word_Counts_8.jpeg
|   |-- LEAST_Word_Counts_9.jpeg
|   |-- MOST_Word_Counts_0.jpeg
|   |-- MOST_Word_Counts_10.jpeg
|   |-- MOST_Word_Counts_11.jpeg
|   |-- MOST_Word_Counts_12.jpeg
|   |-- MOST_Word_Counts_13.jpeg
|   |-- MOST_Word_Counts_14.jpeg
|   |-- MOST_Word_Counts_15.jpeg
|   |-- MOST_Word_Counts_16.jpeg
|   |-- MOST_Word_Counts_17.jpeg
|   |-- MOST_Word_Counts_18.jpeg
|   |-- MOST_Word_Counts_19.jpeg
|   |-- MOST_Word_Counts_1.jpeg
|   |-- MOST_Word_Counts_20.jpeg
|   |-- MOST_Word_Counts_21.jpeg
|   |-- MOST_Word_Counts_22.jpeg
|   |-- MOST_Word_Counts_23.jpeg
|   |-- MOST_Word_Counts_24.jpeg
|   |-- MOST_Word_Counts_25.jpeg
|   |-- MOST_Word_Counts_26.jpeg
|   |-- MOST_Word_Counts_27.jpeg
|   |-- MOST_Word_Counts_28.jpeg
|   |-- MOST_Word_Counts_29.jpeg
|   |-- MOST_Word_Counts_2.jpeg
|   |-- MOST_Word_Counts_3.jpeg
|   |-- MOST_Word_Counts_4.jpeg
|   |-- MOST_Word_Counts_5.jpeg
|   |-- MOST_Word_Counts_6.jpeg
|   |-- MOST_Word_Counts_7.jpeg
|   |-- MOST_Word_Counts_8.jpeg
|   `-- MOST_Word_Counts_9.jpeg
|-- glove
|   |-- 6471382cdd837544bf3ac72497a38715e845897d265b2b424b4761832009c837
|   |   |-- glove.6B.100d.txt
|   |   |-- glove.6B.200d.txt
|   |   |-- glove.6B.300d.txt
|   |   `-- glove.6B.50d.txt
|   |-- 357baac33090f645e71e253b3295ee1b767c98a0336e9a1d99c77e9e33b43c4a.zip
|   |-- 6471382cdd837544bf3ac72497a38715e845897d265b2b424b4761832009c837.zip
|   `-- glove.42B.300d.txt
|-- heatmaps
|   |-- ANN_glove_2.jpeg
|   |-- ANN_glove.jpeg
|   |-- ANN.jpeg
|   |-- CNN_glove_2.jpeg
|   |-- CNN_glove.jpeg
|   |-- CNN.jpeg
|   |-- Descision_Tree.jpeg
|   |-- Gradient_boosting.jpeg
|   |-- KNN.jpeg
|   |-- Linear_SVC.jpeg
|   |-- Logistic_Regression.jpeg
|   |-- Naive_Bayes.jpeg
|   |-- Random_Forest.jpeg
|   |-- RNN_glove_2.jpeg
|   |-- RNN_glove.jpeg
|   |-- RNN.jpeg
|   |-- SVC.jpeg
|   |-- XGBoost.jpeg
|   `-- XGBoost_tuned.jpeg
|-- Model
|   |-- ANN_glove_1.h5
|   |-- ANN_glove_2.h5
|   |-- ANN.h5
|   |-- CNN_glove_1.h5
|   |-- CNN_glove_2.h5
|   |-- CNN.h5
|   |-- model.zip
|   |-- RNN_glove_1.h5
|   |-- RNN_glove_2.h5
|   `-- RNN.h5
|-- __pycache__
|   `-- utils.cpython-38.pyc
|-- Visuals
|   |-- Acc-Loss
|   |   |-- ann_accuracy_loss.jpeg
|   |   |-- ann_glove_accuracy_loss_2.jpeg
|   |   |-- ann_glove_accuracy_loss.jpeg
|   |   |-- cnn_accuracy_loss.jpeg
|   |   |-- cnn_glove_accuracy_loss_2.jpeg
|   |   |-- cnn_glove_accuracy_loss.jpeg
|   |   |-- rnn_accuracy_loss.jpeg
|   |   |-- rnn_glove_accuracy_loss_2.jpeg
|   |   `-- rnn_glove_accuracy_loss.jpeg
|   |-- AU-ROC
|   |   |-- AUC_ANN_GLOVE_2.jpeg
|   |   |-- AUC_ANN_GLOVE.jpeg
|   |   |-- AUC_ANN.jpeg
|   |   |-- AUC_CNN_GLOVE_2.jpeg
|   |   |-- AUC_CNN_GLOVE.jpeg
|   |   |-- AUC_Descision_Tree.jpeg
|   |   |-- AUC_Gradient_Boosting.jpeg
|   |   |-- AUC_KNN.jpeg
|   |   |-- AUC_KNN_tuned.jpeg
|   |   |-- AUC_Logistic_regression.jpeg
|   |   |-- AUC_NB.jpeg
|   |   |-- AUC_NB_tuned.jpeg
|   |   |-- AUC_Random_Forest.jpeg
|   |   |-- AUC_RNN_GLOVE_2.jpeg
|   |   |-- AUC_RNN_GLOVE.jpeg
|   |   |-- AUC_RNN.jpeg
|   |   |-- AUC_SVC.jpeg
|   |   |-- AUC_SVM_Linear.jpeg
|   |   |-- AUC_XGBoost.jpeg
|   |   `-- AUC_XGBoost_tuned.jpeg
|   |-- T-SNE
|   |   |-- ann_Embeddings_1.jpeg
|   |   |-- ann_glove_Embeddings_1.jpeg
|   |   |-- ann_glove_Embeddings_2.jpeg
|   |   |-- cnn_Embeddings_1.jpeg
|   |   |-- cnn_glove_Embeddings_1.jpeg
|   |   |-- cnn_glove_Embeddings_2.jpeg
|   |   |-- rnn_Embeddings_1.jpeg
|   |   |-- rnn_glove_Embeddings_1.jpeg
|   |   `-- rnn_glove_Embeddings_2.jpeg
|   |-- Ham_Overall_Frequency_Distribution.jpeg
|   |-- ham_vs_spam.jpeg
|   |-- modelComparison.png
|   |-- Overall_Frequency_Distribution.jpeg
|   |-- Spam_Overall_Frequency_Distribution.jpeg
|   |-- wordcloud_ham.jpeg
|   |-- wordcloud_overall.jpeg
|   `-- wordcloud_spam.jpeg
|-- alg-1.png
|-- alg-2-1.png
|-- alg-2-2.png
|-- alg-3.png
|-- alg-4.png
|-- Augmented_emails.csv
|-- Comparison of Models | Adversarial Attacks data Preparation.ipynb
|-- emails.csv
|-- glove.6B.100d.txt
|-- Methodology.jpg
|-- Plot Model Architectures.ipynb
|-- README.md
|-- SPAM Email Classification .ipynb
|-- Spam_Email_Classification_with_ANN,_RNN,_CNN_with_pretrained_glove_Word_embeddings_1_.ipynb
|-- Spam_Email_Classification_with_ANN,_RNN,_CNN_with_pretrained_glove_Word_embeddings_2.ipynb
|-- Spam_Email_Classification_with_ANN,_RNN,_CNN_with_word_embeddings_respectively.ipynb
|-- tfidf.csv
`-- utils.py

Citation :

@INPROCEEDINGS{9672398,
  author={Hasan, Md. Mohidul and Zaman, Syed Mahbubuz and Talukdar, Md. Asif and Siddika, Ayesha and Rabiul Alam, Md. Golam},
  booktitle={2021 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI)}, 
  title={An Analysis of Machine Learning Algorithms and Deep Neural Networks for Email Spam Classification using Natural Language Processing}, 
  year={2021},
  volume={},
  number={},
  pages={1-6},
  doi={10.1109/SOLI54607.2021.9672398}}

shahanHasan/Spam-Email-Classification-Adversarial-Attacks

Spam Email Classification | Adversarial Attacks-

1. A comparative study of Machine learning classifiers and deep neural network learning algorithms applied to the problem of Spam emails

2. Exploration of Adversarial attacks on spam email classification learning algorithms

Machine Learning Classifiers :

Deep Learning Algorithms :

Feature Extraction techniques for Machine Learning Classifiers :

Feature Extraction techniques for Deep Learning Classifiers :

The Glove Embedding Vectors must be downloaded from and placed in the root directory for use.

Metrics for Evaluation :

Adversarial Attacks -

Algorithm 2-1 : Synonym Replacement :

Algorithm 2-2 : Spam / Ham word Injection :

Algorithm 3 : Addition of new Poisoned Emails algorithm (To do):

Adversarial Attacks on Deep Learning Models (To do):

Adversarial attacks defensive mechanism (To do):

Directory Structure of the repo :

Citation :