This project makes use of machine learning to detect phishing webpage. Working in progress, early stage.
- Python 3.6
- Visual Studio Code
- Mac OS Catalina
- Does the domain contain non-ASCII characters?
- Does the URL using an URL shortening service?
- Does the URL have deep level of subdomain?
- Does the URL have low Alexa rank?
- Is the domain not indexed by Google?
- Is the URL redirecting to other domain?
- Does the URL use many external resources?
- Does the URL open new windows?
- Does the URL block right clicks?
- Does the URL use inception bar? (Ref)
A script is written to fetch phish URLs and non-phish URLs. To execute it, go to the project root directory and execute
python3 fetch_data.py
By default, it fetch 100 phish URLs and 100 non-phish URLs. This can be modified in the saveUrls() function.
Better execute in virtual machine because it opens those phishing webpages. Another script is written to do the features extraction and generate the CSV file. To execute it, go to the project root directory and execute
python3 generate_dataset.py
At the current stage, the following algorithms are used for machine learning.
- Logistic Regression
- Decision Tree
- Random Forest
python3 machine_learn.py
Unit tests are written to test specific modules / functions. To execute tests, go to the project root directory and execute
python3 -m unittest