This is the official repository for BinShot (ACSAC 22'), which is a practical binary code similarity detection tool with BERT-based transferable similarity learning.
pip install -r requirements.txt
Besides, you need to install the followings:
- python3 (tested on 3.8)
- IDA Pro (tested on 7.6)
- pytorch (tested on 1.11)
The proprocessed data used in our paper can be found by following google drive link:
https://bit.ly/3xov03n
Download the data, and then move them into "corpus" directory.
python3 bert_mlm.py \
-cd corpus/pretrain.all.corpus.txt \
-vp corpus/pretrain.all.corpus.voca \
-op models/pretrain
python3 binshot.py \
-bm models/pretrain/model_bert/bert_ep19.model \
-vp corpus/pretrain.all.corpus.voca \
-op models/downstream \
-r all \
-tn corpus/binsim.all.train.corpus.txt \
-vd corpus/binsim.all.valid.corpus.txt \
-tt corpus/binsim.all.test.corpus.txt
To get metrics across different compilers and optimization levels (e.g., clangO0 & gccO2), run following command:
python result.py -s models/downstream/pred.test.all_all -v models/downstream
The result will be written in the json file placed in models/downstream (-v option).
python3 binshot.py \
-bm models/pretrain/model_bert/bert_ep19.model \
-vp corpus/pretrain.all.corpus.voca \
-op models/spec06 \
-r spec06 \
-tn corpus/binsim.spec06.train.corpus.txt \
-vd corpus/binsim.spec06.valid.corpus.txt \
-tt corpus/binsim.spec06.test.corpus.txt
python3 binshot.py \
-bm models/pretrain/model_bert/bert_ep19.model \
-vp corpus/pretrain.all.corpus.voca \
-op models/spec17 \
-r spec17 \
-tn corpus/binsim.spec17.train.corpus.txt \
-vd corpus/binsim.spec17.valid.corpus.txt \
-tt corpus/binsim.spec17.test.corpus.txt
python3 binshot.py \
-bm models/[spec06,spec17]/model_sim/bert_ep19.model \
-fm models/[spec06,spec17]/model_sim/sim_ep19.model \
-vp corpus/pretrain.all.corpus.voca \
-op models/[spec06,spec17] \
-r [gnu,spec06,spec17,rwp] \
-tt corpus/binsim.[gnu,spec06,spec17,rwp].test.corpus.txt
python3 binshot.py \
-bm models/downstream/model_sim/bert_ep19.model \
-fm models/downstream/model_sim/sim_ep19.model \
-vp corpus/pretrain.all.corpus.voca \
-op models/downstream \
-r cve \
-tt corpus/cve.corpus.txt
In this evaluation, if any of functions of interest is similar with a target,
it should be predicted as positive.
To get metrics corresponding to our realistic scenario (see paper),
run following command:
python result_cve.py -s models/downstream/pred.test.cve_all -v models/downstream
The result will be written in the txt file placed in models/downstream (-v option).
The following codes will be run with sample binaries in our repo.
- Binary should have execution permission.
- Binary name format should be "binname-IA-compiler-optlv" (e.g., find-amd64-gcc-O2)
- Run following commands
mkdir -p norm/findutils norm/cve norm/cve_strip
bash gen_ida.sh binary/findutils/
bash gen_ida.sh binary/cve/
bash gen_ida.sh binary/cve_strip/
bash gen_norm.sh binary/findutils/ norm/findutils/
bash gen_norm.sh binary/cve/ norm/cve/
bash gen_norm.sh binary/cve_strip/ norm/cve_strip/
python3 corpusgen.py -d binary/findutils/ -pkl norm/findutils/ -o corpus/ -t
python3 voca.py corpus/pretrain.findutils.corpus.voca.txt
python3 corpusgen.py -d binary/findutils/ -pkl norm/findutils/ -o corpus/ -b
python3 corpusgen.py -f corpus/binsim.findutils.corpus.txt -y binsimtask -p
python3 corpusgen.py -d binary/cve/ -pkl norm/cve/ -o corpus/ -c
This project is licensed under the MIT License - see the LICENSE file for details.
If your research employs BinShot, please cite the following paper:
@INPROCEEDINGS{binshot,
author = {Sunwoo Ahn and Seonggwan Ahn and Hyungjoon Koo and Yunheung Paek},
title = {Practical Binary Code Similarity Detection with BERT-based
Transferable Similarity Learning}
booktitle = {Proceedings of the 38th Annual Computer Security
Applications Conference (ACSAC)},
month = {Dec.},
year = {2022},
location = {Austin, Texas}
}