Implementation of paper "Epitome"
To use Epitome, we need the following tools installed
- IDA Pro 7.3 - for generating the control dependency instructions and data dependency instructions of functions
- python2.7 - the script used by IDA Pro is written in python2.7
- miasm - for converting assembly programs to LLVM IR. We extend it to support more assembly instructions. Please directly copy the
miasm
provided by us to the python directory ofIDA Pro
. - python3.8 - the scripts used for the training model are written in python3.8
- NLTK, and SentencePiece - for function name preprocessing
- Gensim - for label relationship identification.
- We built the other components of Epitome with Pytorch and Transformer.
pre_train/dataset_generation
: it contains the script to generate a dataset for the assembly language model.pre_train/pre_train_model
: it contains the scripts for training the assembly language model.dataset_generation
: it contains the scripts to generate a dataset for multi-task learning.training_evaluation
: it contains the scripts for training Epitome.label_relationship_identify
: it contains the scripts for function name processing.
To test the model, please first download the processed dataset and put it under the dataset directory. Because the training data is too large, it is difficult to upload now and will be made public later.
- We need to modify the
pre_train/dataset_generation/data_gen_config.py
file. Simple modification is listed as following, but it need to follow the directory structure we defined:
IDA32_DIR = "installation directory of 32-bit IDA Pro program"
IDA64_DIR = "installation directory of 64-bit IDA Pro program"
- We should set ROOT_DIR for the root path of code, and DATA_ROOT_DIR for the path of binaries.
- We run the
pre_train/dataset_generation/data_gen_command.py
file to generate the control dependency instructions and data dependency instructions for functions.
cd pre_train/dataset_generation/
python2 data_gen_command.py
python3 4_merge_dataset.py
Note: All steps can be executed in the Linux system.
The script for training the model is pre_train/pre_train_model/run_pretrain.sh
, in which you have to set the following parameters:
num_train_epochs = 20 # Num of epoch for training
train_cfg_dataset = "ROOT_DIR/feature/single_cfg_train_X64.txt" # The train path of pairs of control dependency instructions
train_dfg_dataset = "ROOT_DIR/feature/single_dfg_train_X64.txt" # The train path of pairs of data dependency instructions
test_cfg_dataset = "ROOT_DIR/feature/single_cfg_test_X64.txt" # The valid path of pairs of control dependency instructions
test_dfg_dataset = "ROOT_DIR/feature/single_dfg_test_X64.txt" # The valid path of pairs of data dependency instructions
vocab_path = "./modelout/vocab" # the path for save vocab
per_device_train_batch_size = 256 # train batch size
per_device_eval_batch_size = 16 # # valid batch size
learning_rate = 5e-5 #learning rate
max_seq_length = 32 # the length of pairs of instructions
output_dir = "./modelout/" # the model save path
warmup_steps = 10000 # Warmup the learning rate over this many updates
To train the model, run the run_pretrain.sh
cd pre_train/pre_train_model/
bash run_pretrain.sh
The pretrained model was obtained from pretrained model and the assembly instructions vocab was obtained from vocab. You can put them under the pre_train/pre_train_model/modelout directory.
- We need to modify the
dataset_generation/config.py
file. Simple modification is listed as following, but it need to follow the directory structure we defined:
IDA32_DIR = "installation directory of 32-bit IDA Pro program"
IDA64_DIR = "installation directory of 64-bit IDA Pro program"
- We should set ROOT_DIR for the root path of code, and DATA_ROOT_DIR for the path of binaries.
- We run the
dataset_generation/command.py
file to generate the CFG for functions. - We run the
dataset_generation/2.3_process_function_name.py
file for function name preprocessing.
Note: All steps can be executed in the Linux system.
cd dataset_generation/
python2 command.py
python3 2.3_process_function_name.py
In order to speed up the training process, we binarize the dataset. You have to set the following parameters:
datapre = '' # location of the data corpus and test dataset
inst_vocab_path = '../pre_train/pre_train_model/modelout/vocab' #assembly instruction vocab path which is pretrain model generate
label_vocab_path = './modelout/label_vocab' # function name vocab path
max_label_num = 10 # function name length
node_len = 16 # instruction length
node_num = 256 # the number of node in fined-grained CFG
To binarize the dataset, run the generate_dataset.py
cd training_evaluation
python3 generate_dataset.py
The script for training the model is training_evaluation/train.py
, in which you have to set the following parameters in training_evaluation/model_config.py
:
datapre = '../dataset/X64_dataset' # location of the data corpus and test dataset
node_vocab_path = '../pre_train/pre_train_model/modelout/vocab' #assembly instruction vocab path which is pretrain model generate
graph_label_vocab_path = './modelout/label_vocab' # function name vocab path
pre_train_model ='../pre_train/pre_train_model/modelout/best_ceckpoint' # pre-train model path
save_path = './modelout/'
lr = 1e-4 # Initial learning rate
min_lr = 1e-6 # Minimum learning rate.
dropout = 0.1 # Dropout rate (1 - keep probability)
target_len = 10 # function name length
node_len = 16 # instruction length
node_num = 256 # the number of node in fined-grained CFG
num_blocks = 6 # the num block of transformer
num_heads = 8 # the num head of transformer
batch_size = 64 # Input batch size for training
epochs = 30 # Number of epochs to train
emb_dim = 128 # Size of embedding
conv_feature_dim = 128 # Size of conv layer in node embedding
hidden_dim = 256 # Size of hidden size
radius = 2 # Diameter of ego networks
beam = 0 # 'beam size'
factor = 0.5 # Factor in the ReduceLROnPlateau learning rate scheduler
patience = 3 #Patience in the ReduceLROnPlateau learning rate scheduler
To train the model, run the train.py
cd training_evaluation
python3 train.py
The pretrained model was obtained from pretrained model and the label vocab was obtained from vocab. You can put them under the training_evaluation/modelout directory.
The script for training the model is training_evaluation/test.py
, in which you have to set the following parameters in training_evaluation/model_config.py
:
test_datapre ='../dataset/test_X64_dataset' # location of the test dataset
node_vocab_path = '../pre_train/pre_train_model/modelout/vocab' #assembly instruction vocab path which is pretrain model generate
graph_label_vocab_path = './modelout/label_vocab' # function name vocab path
pre_train_model ='../pre_train/pre_train_model/modelout/best_ceckpoint' # pre-train model path
word_net_path = 'X64_wordnet.json'
save_path = './modelout/'
test_opt = ['O0', 'O1', 'O2', 'O3', 'Os'] # compilation optimizations for testing
target_len = 10 # function name length
node_len = 16 # instruction length
node_num = 256 # the number of node in fined-grained CFG
num_blocks = 6 # the num block of transformer
num_heads = 8 # the num head of transformer
batch_size = 64 # Input batch size for training
epochs = 30 # Number of epochs to train
emb_dim = 128 # Size of embedding
conv_feature_dim = 128 # Size of conv layer in node embedding
hidden_dim = 256 # Size of hidden size
radius = 2 # Diameter of ego networks
beam = 0 # 'beam size'
The predicted names are saved in the training_evaluation/modelout/prediction
directory. Note that, the result of the evaluation is printed.
We address the challenges of the noisy nature of natural languages, we propose to generate the distributed representation of function name words and calculate the semantic distance to identify the relationship of labels. We provide the script label_relationship_identify/train_model.py
for label embedding model training.