binary_tpl_detection
Dataset url: https://figshare.com/s/4a007e78f29243531b8c
Feature Extractor
- The extractor extracts features from all binary files under a given directory and save features to a json file.
- Input: directory
- Output: two files, stored in a given target directory.
- Information such as running time is stored in the
status
file. - Extracted features are stored in the features file, such as
9760608.json
. The format of this json is a list of BinaryFile entity.
- Information such as running time is stored in the
- It is recommended to put your task code under
consumer
directory (infeatureExtractor/bcat_client/src/main/java/thusca/bcat/client/consumer
). See the example inconsumer/BinFileFeatureExtractTest.java
Pre-requisites
Basic knowledge about Java Development, Springboot and Annotation Development.
For example, if you use IDE like VScode or Idea, basic java development environment need to be installed such as Java Extension Pack
, MAVEN for JAVA
. It should be noted that we use Lombok Annotation and Springboot in code that may depend on extensions Lombok Annotations Support
and Spring Boot Tools
for IDE to debug or run. Besides, LibmagicJnaWrapper depends on libmagic to get file type, please install this library and modify the paths in LibmagicJnaWrapper.java. It can be easily installed using apt/brew command on Linux/MacOS.
Build Artifact
Env:
- Java: Java 11.
- IntelliJ Idea. (We have found that the extractor artifact works well only under IntelliJ Idea to build the artifact. Tested successful under Windows IntelliJ Idea 2021.2)
Steps:
- Ghidra: 9.1.2. The file
ghidra.jar
is stored under/user/lib/ghidra.jar
you should put it under/featureExtractor/bcat_client/lib
first. - Open Idea, open project "binary_lib_detection-main\featureExtractor". Wait until indexing finish, if error occurs, try reopen/clean the project.
- File -> Project Structure -> Project SDK, select Java SDK 11.
- File -> Project Structure -> Artifacts -> "+" -> jar -> from modules with dependencies -> Module ("bcat_client") -> Main Class ("ClientApplication") -> JAR files from libraries (select
copy to the output directory and link via manifest
) 5. The jars will be generated at path: featureExtractor\out\artifacts\bcat_client_jar, withbcat_client.jar
inside.
Task
Methods for all tasks are stored under the directory /consumer
.
Building database: Code:Task2ExtractCoreFedora.java
, Data: FedoraLib_Dataset
. Set tha save path and get all features to build TPL feature database. We use the directory ../data/CoreFedoraFeatureJson0505
to represent the save path.
Run
Zip the bcat_client_jar folder and upload to a Linux server, unzip, and run:
java -jar bcat_client.jar
Note: Java 11 required.
Func similarity Model
This model is used to determine if two functions are similar based on Gemini Network.
Prepration and Data
Data is stored in ../data/vector_deduplicate_gemini_format_less_compilation_cases
.
or Cross-5C_Dataset.7z on figshare.
By default, we use the path ../data
under main/torch
to store the data. Please copy them under it.
Environment Step
The network is written using Torch 1.8 in Python 3.8. Torch installation is based on cuda 11.
conda create -n tpldetection python=3.8 ipykernel
bash
conda activate tpldetection
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip install -r requirements.txt
Milvus v1.1.1(vector search engine) is necessary for function retrival. It requires docker 19.03 or higher ref: https://milvus.io/docs/v1.1.1/milvus_docker-gpu.md
sudo docker pull milvusdb/milvus:1.1.1-gpu-d061621-330cc6
mkdir -p /home/$USER/milvus/conf
cd /home/$USER/milvus/conf
wget https://raw.githubusercontent.com/milvus-io/milvus/v1.1.1/core/conf/demo/server_config.yaml
sudo docker run -d --name milvus_gpu_1.1.1 --gpus all \
-p 19530:19530 \
-p 19121:19121 \
-v /home/$USER/milvus/db:/var/lib/milvus/db \
-v /home/$USER/milvus/conf:/var/lib/milvus/conf \
-v /home/$USER/milvus/logs:/var/lib/milvus/logs \
-v /home/$USER/milvus/wal:/var/lib/milvus/wal \
milvusdb/milvus:1.1.1-gpu-d061621-330cc6
Run
Run the following command to train the model:
# train/validation dataset: /data/func_comparison/vector_deduplicate_our_format_less_compilation_cases/train_test
# test dataset: /data/func_comparison/vector_deduplicate_our_format_less_compilation_cases/valid
cd main/torch
bash run.sh
A trained model is saved under ../data/7fea_contra_torch_b128/saved_model/
Library detection
Database
Embedding
raw feature database: ../data/CoreFedoraFeatureJson0505
Embeddings:
set the path ../data/CoreFedoraFeatureJson0505
as args.fedora_js
.
You can use mutilprocess to speed up and the code is writen in core_fedora_embeddings.py
as follows:
with Pool(10) as p:
p.starmap(core_fedora_embedding, [(i, True) for i in range(10)])
all embeddings are saved under the args.save_path
.
We use the path ../data/7fea_contra_torch_b128/core_funcs
to represent it.
Indexing and Building Milvus dataset
run build_milvus_database.py
to build function vector database using Mulvis.
the function get_bin_fcg
is used to generate an indexing file containing binary to functions to accelarate.
get_bin2func_num
generates an indexing from binary to the number of funtions in it.
Detection
Data: detection_targets. Firstly, extract features from APKs. See the method localExtractOSSPoliceApks
in TaskProcessTargets.java
under the directory consumer
. We use the directory../data/detection_targets/feature_json
to save all extracted features.
see the function detect_v2
in function_vector_channel.
Other methods + FCG Filter can be seen in files xxx_afcg.py
.
Baselines are under the directory /related_work
.
We combine basic feature channel (B2SFinder(basic features) + FCG Filter) and function vector channel together to report the final results.
All files named analyze_results.py
are used to calculate precision and recall.