Phishing Websites Predictions

This project uses Phishing Websites dataset from UCI machine learning Datasets. The objective is to identify whether a website is a Phishing website one or not.

Codebook

There are 31 columns in the dataset, containing 30 features and 1 target. In total there are 2456 observations in the dataset. I have used 75% of observations(1843) as the training set and remaining(613) for test set. Here is a list of all the attributes in the dataset, along with their possible values and column names used:

Attributes	Values	Column Name
Having IP Address	{ 1,0 }	has_ip
Having long url	{ 1,0,-1 }	long_url
Uses ShortningService	{ 0,1 }	short_service
Having '@' Symbol	{ 0,1 }	has_at
Double slash redirecting	{ 0,1 }	double_slash_redirect
Having Prefix Suffix	{ -1,0,1 }	pref_suf
Having Sub Domain	{ -1,0,1 }	has_sub_domain
SSLfinal State	{ -1,1,0 }	ssl_state
Domain registeration length	{ 0,1,-1 }	long_domain
Favicon	{ 0,1 }	favicon
Is standard Port	{ 0,1 }	port
Uses HTTPS token	{ 0,1 }	https_token
Request_URL	{ 1,-1 }	req_url
Abnormal URL anchor	{ -1,0,1 }	url_of_anchor
Links_in_tags	{ 1,-1,0 }	tag_links
SFH	{ -1,1 }	SFH
Submitting to email	{ 1,0 }	submit_to_email
Abnormal URL	{ 1,0 }	abnormal_url
Redirect	{ 0,1 }	redirect
on mouseover	{ 0,1 }	mouseover
Right Click	{ 0,1 }	right_click
popUp Window	{ 0,1 }	popup
Iframe	{ 0,1 }	iframe
Age of domain	{ -1,0,1 }	domain_age
DNS Record	{ 1,0 }	dns_record
Web traffic	{ -1,0,1 }	traffic
Page Rank	{ -1,0,1 }	page_rank
Google Index	{ 0,1 }	google_index
Links pointing to page	{ 1,0,-1 }	links_to_page
Statistical report	{ 1,0 }	stats_report
Result	{ 1,-1 }	target

All the attributes having a binary value space are generally denoting the absence or presence of respective attribute. Attributes with three possible values are generally representing the strength(low, medium, high).

R Script

Identification of the possible phishing websites is done in R with caret.

The R script - phishing.R initially load the required libraries and the dataset from phishing.csv file
Column names are set using names array(as shown in codebook above)
Dataset is then split into training and test set useing caret's createDataPartition method
Then three different models are applied on the training dataset - boosted Logistic Regression, SVM with RBF Kernel, Tree Bag
For each model we get the confusionMatrix after predicting the samples from test set

Codebase Structure

Ipython Notebooks\ - contains ipython notebooks used with BigML and to paritition train and test set
Datasets\ - contains CSV Data files used in BigML and R Script
attributes.txt - contains info about the attributes in Dataset
phishing.R - R Script to apply treebag model(similar to BigML-ensemble)
Conclusion.pdf - Anwer for - do you think these predictions are good?
BigML_classification.py - Python Script for calling and running ensemble model on BigML API
BigML_summary.txt - Summary of BigML model

Results

I was able to get 96.4% accuracy with the treebag model. Here is a plot for the variable importance in the tree bag model.

WhiteIsClosing/phishing-websites

Phishing Websites Predictions

Codebook

R Script

Codebase Structure

Results