/PhishingWebsiteClassifiers

Performance Evaluation of Machine Learning Classifiers for the Detection of Phishing Websites

Primary LanguageJupyter NotebookMIT LicenseMIT

Project Title

PERFORMANCE EVALUATION OF MACHINE LEARNING CLASSIFIERS FOR THE DETECTION OF PHISHING WEBSITES

Project Description

Phishing is a very common type of cybercrime attack in which the personally identifiable information of the targets are used for financial gains. The classification of phishing websites from their legitimate counterparts using machine learning architectures has been researched in some of the current studies, but further study is required in this area. In this research, the performance of five machine learning techniques is experimentally compared for the classification of websites. Two publicly available datasets from Mendeley were used. The datasets were spitted into two different variation making it a total of four. StandardScaler and Principal Component Analysis (PCA) were used as data preprocessing techniques on the datasets before they were fed into the machine learning models. The single model that gave the best performance is the Random Forest (RF) in all the variation of the two datasets. In the ensemble models, the combination of Random Forest with Extremely Randomized Tree outperformed others. A key result of this research is that the two models having the same base learner (Decision Tree) outperformed other traditional machine learning models used.

Language and Tools

The Python programming language was used in this project from start to end.

python pandas scikit_learn

Table of Contents

How to install and Run the Project

The program was run on Google colab (Ubuntu 20.04 OS)

How to Use

You can edit the path variables existing thoughout the files to fit your file organization as many files are created and modified from this project. The entire project was run on google colab with storage on google drive. You can also run on your local instance but you would need a considerablea amount of resources as there are some resource-heavy computations present.

How to Contribute

If you would love to participate, you can fork this repo and contribute by doing any of the folllowing

  • Optimizing existing solution
  • Writing documentations and comments for notebooks
  • Write tests
  • Add issues to be fixed

Support

https://www.buymeacoffee.com/theedon