This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).
- Python 3.7.6
- Python packages
- dgl 0.6.0
- numpy 1.18.1
- pandas 1.2.0
- scipy 1.4.1
- sklearn 0.0
- tensorboard 2.2.1
- torch 1.5.0
- torchtext 0.2.0
- tqdm 4.42.1
- wget 3.2
- C++14 compatible compiler
- Clang++ 3.7.1
- Download dataset
- Download POJ-104 datasets from here and extract them into
data/
.
- Download POJ-104 datasets from here and extract them into
- Compile and preprocess
- Run
python preprocess/extract_obj.py -asm data/obj
(clang++-3.7.1 required) - Run
python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl
to split the dataset into train/valid/test sets. - Run
python preprocess/sim_preprocess.py
to compile the binary code into graphs data. - *(part of the preprocessing code are from [1])
- Run
- Cramming the binary dataset
- The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in
preprocess/cram_vul_dataset
.
- The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in
- Download Preprocessed data
- Run
./preprocess.sh
(clang++-3.7.1 required), or - You can directly download the preprocessed datasets from here and extract them into
data/
. - Run
python preprocess/vul_preprocess.py
to compile the binary code into graphs data
- Run
- Download dataset
- Download the demo datasets (raw and preprocessed data) from here and extract them into
data/
. (More datasets to come.) - No need to compile the code into graph again as the data has already been preprocessed.
- Download the demo datasets (raw and preprocessed data) from here and extract them into
- Run
cd baseline_model && python run_similarity_check.py
- Run
cd baseline_model && python run_vulnerability_detection.py
- Dump the trace of tree expansion:
- To accelerate the online processing of the tree output, we will dump the trace of the trea data by running
python -m preprocess.dump_trace
- To accelerate the online processing of the tree output, we will dump the trace of the trea data by running
- Training scripts:
- First,
cd baseline model
. - To train the model using torch parallel, run
python run_tree_transformer.py
. - To train it on multi-gpu using distribute pytorch, run
python run_tree_transformer_multi_gpu.py
- To evaluate, run
python run_tree_transformer.py --eval
- To evaluate a multi-gpu trained model, run
python run_tree_transformer_multi_gpu.py --eval
- First,
[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).
[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.
[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).
This repo is CC-BY-NC licensed, as found in the LICENSE file.