It is a common misconception that if there is a padlock symbol next to the website URL, the site is always safe. The padlock icon only indicates that the communication between the user's browser and the website is encrypted, which helps protect the data from eavesdropping or interception. Scammers previously didn't bother with digital certificates for fake websites, making them easier to spot. However, scammers have become more sophisticated, using fake certificates or exploiting legitimate sites. While the padlock symbol is helpful, users should verify authenticity through other means like checking the URL and trust indicators.
These refer to statistical features extracted from the literal URL string. For example, length of the URL string, number of digits, number of parameters in its query part, if the URL is encoded, etc. Example, ‘amazon.com.support.info’.
These provide information about the host of the webpage, for example, country of registration, domain name properties, named servers, connection speed, time to live from registration, etc. The motivation behind including these parameters is that there is a difference in website deployment tactics, the longevity of existence, and the reputation for malicious and benign sites.
These are obtained from the downloaded HTML code of the webpage. These features capture the structure of the webpage and the content embedded in it. These will include information on script tags, embedded objects, executables, hidden elements, etc. For example, in an SQL injection attack, anomalies such as malformed documents or repeated tags show up in raw HTML content.
Lexical Features | Host-based Features | Content-based Features |
---|---|---|
url_of_anchor | registration_length | web_traffic |
sub_domain | age_of_domain | favicon |
having_- | having_ip | redirect |
links_in_tags | google_index | submitting_to_email |
sfh | dns_record | statistical_report |
request_url | mouse_over | |
url_length | iframe | |
https_token | rightclick | |
shortening_service | ||
having_@ | ||
abnormal_url | ||
having_// |
UI: HTML5, CSS3.
Backend: Python3.
Libraries: beautifulsoup4, googlesearch-python, scikit-learn, pandas, requests, whois.