/Phishing-Dataset

Phishing dataset with more than 88,000 instances and 111 features. Web application available at. https://gregavrbancic.github.io/Phishing-Dataset/

Primary LanguageSvelte

Datasets for Phishing Websites Detection

In this repository the two variants of the phishing dataset are presented.

Web application

To preview the dataset interactively and/or tailor it to your needs, please visit a dedicated web application.

dataset_full.csv

Short description of the full variant dataset:

  • Total number of instances: 88,647
    • Number of legitimate website instances (labeled as 0): 58,000
    • Number of phishing website instances (labeled as 1): 30,647
  • Total number of features: 111 (without target)

dataset_small.csv

Short description of the small variant dataset:

  • Total number of instances: 58,645
    • Number of legitimate website instances (labeled as 0): 27,998
    • Number of phishing website instances (labeled as 1): 30,647
  • Total number of features: 111 (without target)

Extracted Features

Feature Description
qty_dot_url count (.) in URL
qty_hyphen_url count (-) in URL
qty_underline_url count (_) in URL
qty_slash_url count (/) in URL
qty_questionmark_url count (?) in URL
qty_equal_url count (=) in URL
qty_at_url count (@) in URL
qty_and_url count (&) in URL
qty_exclamation_url count (!) in URL
qty_space_url count ( ) in URL
qty_tilde_url count (~) in URL
qty_comma_url count (,) in URL
qty_plus_url count (+) in URL
qty_asterisk_url count (*) in URL
qty_hashtag_url count (#) in URL
qty_dollar_url count ($) in URL
qty_percent_url count (%) in URL
qty_tld_url top-level-domain length
length_url URL length
qty_dot_domain count (.) in domain
qty_hyphen_domain count (-) in domain
qty_underline_domain count (_) in domain
qty_slash_domain count (/) in domain
qty_questionmark_domain count (?) in domain
qty_equal_domain count (=) in domain
qty_at_domain count (@) in domain
qty_and_domain count (&) in domain
qty_exclamation_domain count (!) in domain
qty_space_domain count ( ) in domain
qty_tilde_domain count (~) in domain
qty_comma_domain count (,) in domain
qty_plus_domain count (+) in domain
qty_asterisk_domain count (*) in domain
qty_hashtag_domain count (#) in domain
qty_dollar_domain count ($) in domain
qty_percent_domain count (%) in domain
qty_vowels_domain count vowels in domain
domain_length domain length
domain_in_ip URL domain in IP address format
server_client_domain domain contains the keywords "server" or "client"
qty_dot_directory count (.) in directory
qty_hyphen_directory count (-) in directory
qty_underline_directory count (_) in directory
qty_slash_directory count (/) in directory
qty_questionmark_directory count (?) in directory
qty_equal_directory count (=) in directory
qty_at_directory count (@) in directory
qty_and_directory count (&) in directory
qty_exclamation_directory count (!) in directory
qty_space_directory count ( ) in directory
qty_tilde_directory count (~) in directory
qty_comma_directory count (,) in directory
qty_plus_directory count (+) in directory
qty_asterisk_directory count (*) in directory
qty_hashtag_directory count (#) in directory
qty_dollar_directory count ($) in directory
qty_percent_directory count (%) in directory
directory_length directory length
qty_dot_file count (.) in file
qty_hyphen_file count (-) in file
qty_underline_file count (_) in file
qty_slash_file count (/) in file
qty_questionmark_file count (?) in file
qty_equal_file count (=) in file
qty_at_file count (@) in file
qty_and_file count (&) in file
qty_exclamation_file count (!) in file
qty_space_file count ( ) in file
qty_tilde_file count (~) in file
qty_comma_file count (,) in file
qty_plus_file count (+) in file
qty_asterisk_file count (*) in file
qty_hashtag_file count (#) in file
qty_dollar_file count ($) in file
qty_percent_file count (%) in file
file_length file length
qty_dot_params count (.) in parameters
qty_hyphen_params count (-) in parameters
qty_underline_params count (_) in parameters
qty_slash_params count (/) in parameters
qty_questionmark_params count (?) in parameters
qty_equal_params count (=) in parameters
qty_at_params count (@) in parameters
qty_and_params count (&) in parameters
qty_exclamation_params count (!) in parameters
qty_space_params count ( ) in parameters
qty_tilde_params count (~) in parameters
qty_comma_params count (,) in parameters
qty_plus_params count (+) in parameters
qty_asterisk_params count (*) in parameters
qty_hashtag_params count (#) in parameters
qty_dollar_params count ($) in parameters
qty_percent_params count (%) in parameters
params_length parameters length
tld_present_params TLD presence in arguments
qty_params number of parameters
email_in_url email present in URL
time_response search time (response) domain (lookup)
domain_spf domain has SPF
asn_ip AS Number (or ASN)
time_domain_activation time (in days) of domain activation
time_domain_expiration time (in days) of domain expiration
qty_ip_resolved number of resolved IPs
qty_nameservers number of resolved name servers (NameServers - NS)
qty_mx_servers number of MX Servers
ttl_hostname time-to-live (TTL) value associated with hostname
tls_ssl_certificate valid TLS / SSL Certificate
qty_redirects number of redirects
url_google_index check if URL is indexed on Google
domain_google_index check if domain is indexed on Google
url_shortened check if URL is shortened
phishing is phishing website

Cite this dataset

G. Vrbančič, I. Jr. Fister, V. Podgorelec. Datasets for Phishing Websites Detection. Data in Brief, Vol. 33, 2020, DOI: 10.1016/j.dib.2020.106438