URL Feature extractor

Feature extractor from the paper Federated Learning For Cyber Security: SOC Collaboration For Malicious URL Detection

A code to extract lexicographical features from URLs. Takes as an input the csv file with different URLs and generates 72 features per URL.

The resulting extracted features from the dataset with more than 700K malicious and benign URLs can be found in the archive urls_final_complete.tar.xz .

Initial URL dataset represents a collection from different sources. The urls are destributed between malware, defacement, phishing, spam and benign classes. They are taken from different sources, in particular from ISCX-URL-2016, that was further augmented by:

  1. Benign: Hacker News, PhishStorm, Ebbu2017 Dataset
  2. Malware: URLHaus
  3. Phishing: Openphish, PhishTank

Resulting collection of URLs can be found here

For more details on class distribution, as well as our other experiments please conult the paper.