This is an implementation of the paper: TransURL:Improving Malicious URL Detection with Multi-layer Transformer Encoding and Multi-scale Pyramid Features
The proposed framework, which is based on a character-aware pretrained language model and includes a novel feature concatenation method, a multi-scale learning module with pyramid spatial attention mechanism, demonstrates good performance in the detection of malicious URLs.
Folder/File Name | Purpose |
---|---|
/Data | Where the dataset files are stored under their unique dataset names. Contains all the dataset we used in our paper. |
/character_bert_wiki | Where the pre-trained model CharBERT based on BERT is stored. You can also download using this link. |
Multiple_attention.py | Contains the proposed multi-level attention module. |
data_processing.py | Contains functions for preprocessing the dataset, including converting URL strings into BERT's input format and splitting it into training and validation sets. |
Model_MMA.py | Where the proposed Malicious URL Detection model PyraTrans architecture is stored. |
Model_CharBERT.py | This file contains the original CharBERT model part. |
Train.py | Trains the model according to given parameters, You need to modify the "pre-trained model path" and "dataset path". |
Test_binary.py | Tests the pre-trained model in the binary classification experiments according to given parameters. |
Test_Multiple.py | Tests the pre-trained model in the multiple classification experiment according to given parameters. |
bert_utils.py | Some basic source codes of original bert which are used in the CharBERT structure. |
adverserial_generator.py | Where the functions to generate adversarial examples by using specified samples are stored. |
Python 3.8
Pytorch 2.0.0
In all datasets for training or testing, each line includes the label and the URL text string following the template:
<URL String><label>
Example:
The URL label is either +1 (Malicious) or 0 (Benign).
http://www.exampleforbenign.com/urlpath/ 0
http://www.exampleformalicious.com/urlpath/ 1
You can adjust the hyper parameters for your specific computing device. Also can further tuning the batch_size
, learning_rate
and train_epoch.
To train the model, use the 'Train.py
' file, and you can find descriptions of the training parameters that influence the training phase below.
Parameter | Description | Default |
---|---|---|
dataset | An enumeration value specifies the dataset used for training the model. | malware_urls.txt |
model_path | Output folder to save the model | /model.pth |
batch_size | Number of URLs in each training batch | 64 |
num_epoch | Number of training epochs | 3 |
lr | Learning rate | 2e-5 |
pad_size | The maximum number of characters in a URL. The URL is either truncted or padded with a <PADDING> character to reach this length. (Based on preliminary analysis, the maximum text length is 38, taking 32 to cover 99%) |
200 |