/phishing-websites

Identifies phishing websites using a treebag model

Primary LanguageJupyter Notebook

Phishing Websites Predictions

This project uses Phishing Websites dataset from UCI machine learning Datasets. The objective is to identify whether a website is a Phishing website one or not.

Codebook

There are 31 columns in the dataset, containing 30 features and 1 target. In total there are 2456 observations in the dataset. I have used 75% of observations(1843) as the training set and remaining(613) for test set. Here is a list of all the attributes in the dataset, along with their possible values and column names used:

Attributes Values Column Name
Having IP Address { 1,0 } has_ip
Having long url { 1,0,-1 } long_url
Uses ShortningService { 0,1 } short_service
Having '@' Symbol { 0,1 } has_at
Double slash redirecting { 0,1 } double_slash_redirect
Having Prefix Suffix { -1,0,1 } pref_suf
Having Sub Domain { -1,0,1 } has_sub_domain
SSLfinal State { -1,1,0 } ssl_state
Domain registeration length { 0,1,-1 } long_domain
Favicon { 0,1 } favicon
Is standard Port { 0,1 } port
Uses HTTPS token { 0,1 } https_token
Request_URL { 1,-1 } req_url
Abnormal URL anchor { -1,0,1 } url_of_anchor
Links_in_tags { 1,-1,0 } tag_links
SFH { -1,1 } SFH
Submitting to email { 1,0 } submit_to_email
Abnormal URL { 1,0 } abnormal_url
Redirect { 0,1 } redirect
on mouseover { 0,1 } mouseover
Right Click { 0,1 } right_click
popUp Window { 0,1 } popup
Iframe { 0,1 } iframe
Age of domain { -1,0,1 } domain_age
DNS Record { 1,0 } dns_record
Web traffic { -1,0,1 } traffic
Page Rank { -1,0,1 } page_rank
Google Index { 0,1 } google_index
Links pointing to page { 1,0,-1 } links_to_page
Statistical report { 1,0 } stats_report
Result { 1,-1 } target

All the attributes having a binary value space are generally denoting the absence or presence of respective attribute. Attributes with three possible values are generally representing the strength(low, medium, high).

R Script

Identification of the possible phishing websites is done in R with caret.

  • The R script - phishing.R initially load the required libraries and the dataset from phishing.csv file
  • Column names are set using names array(as shown in codebook above)
  • Dataset is then split into training and test set useing caret's createDataPartition method
  • Then three different models are applied on the training dataset - boosted Logistic Regression, SVM with RBF Kernel, Tree Bag
  • For each model we get the confusionMatrix after predicting the samples from test set

Codebase Structure

  • Ipython Notebooks\ - contains ipython notebooks used with BigML and to paritition train and test set
  • Datasets\ - contains CSV Data files used in BigML and R Script
  • attributes.txt - contains info about the attributes in Dataset
  • phishing.R - R Script to apply treebag model(similar to BigML-ensemble)
  • Conclusion.pdf - Anwer for - do you think these predictions are good?
  • BigML_classification.py - Python Script for calling and running ensemble model on BigML API
  • BigML_summary.txt - Summary of BigML model

Results

I was able to get 96.4% accuracy with the treebag model. Here is a plot for the variable importance in the tree bag model.

var imp