Phishing Webpage Detection

This project makes use of machine learning to detect phishing webpage. Working in progress, early stage.

Environment

Development

Python 3.6
Visual Studio Code
Mac OS Catalina

Milestones

Functions for Features Extraction

URL and Domain based

Does the domain contain non-ASCII characters?
Does the URL using an URL shortening service?
Does the URL have deep level of subdomain?
Does the URL have low Alexa rank?
Is the domain not indexed by Google?

Code based

Is the URL redirecting to other domain?
Does the URL use many external resources?
Does the URL open new windows?
Does the URL block right clicks?
Does the URL use inception bar? (Ref)

Content based (Future)

Generate Small Data Set

Fetch URLs from PhishTank

A script is written to fetch phish URLs and non-phish URLs. To execute it, go to the project root directory and execute

python3 fetch_data.py

By default, it fetch 100 phish URLs and 100 non-phish URLs. This can be modified in the saveUrls() function.

Extract Features and Generate Dataset

Better execute in virtual machine because it opens those phishing webpages. Another script is written to do the features extraction and generate the CSV file. To execute it, go to the project root directory and execute

python3 generate_dataset.py

Simple Machine Learning (In Progress)

At the current stage, the following algorithms are used for machine learning.

Logistic Regression
Decision Tree
Random Forest

python3 machine_learn.py

Unit Tests

Unit tests are written to test specific modules / functions. To execute tests, go to the project root directory and execute

python3 -m unittest