Sample data & train output (checkpoint):
https://drive.google.com/drive/folders/1mr8SYoCvkJEcT0YWV1UMAsZQg1h3Qqei?usp=sharing
Full data (binaries):
https://drive.google.com/drive/folders/1_FzZUIXfC652Z7GdlT9lyYUikx9Q4MiR?usp=sharing
Prepare data, create graphs
python prepare_data.py
config.prepare.json
Modify {
"dir_data_report": "data/reports/TuTu_sm",
"dir_data_json": "data/json/TuTu_sm",
"dir_data_embedding": "data/embeddings/TuTu_sm",
"dir_data_graph": "data/graphs/TuTu_sm",
"dir_data_graphviz": "data/graphviz/TuTu_sm",
"dir_data_networkx": "data/nx/TuTu_sm",
"dir_data_pickle": "data/pickle/TuTu_sm",
"mapping_labels": {"benign": 0, "malware": 1},
"do_draw": true,
"split_train_test": false,
"train_ratio": 0,
"train_list_file": "data/TuTu_sm_train_list.txt",
"test_list_file": "data/TuTu_sm_test_list.txt",
"process_from": "report",
"train_embedder": true,
"edge_embedder": "tfidf",
"node_embedder": "tfidf",
"preprocess_level": "word",
"max_ft": 100000,
"top_k": 3,
"vector_size": 10,
"dm": 0
}
-
dir_data_report
: path to directory that contains cuckoo reports, divided into N sub-folders (N = number of classes/labels). eg,dir_data_report
has 2 folders:benign
,malware
-
dir_data_json
: path to directory that contains json file of processed behaviors for each report (some reports might have empty behaviors (no api calls), perhaps that binary needs user interactions to execute, therefore, that file did not execute when putting into cuckoo environment => fails to generate api calls) -
dir_data_embedding
: for node/edge embedding -
dir_data_graph
: store graph generated from each report individually (dgl graph) -
dir_data_graphviz
: store .dot (and .svg) files for visualization (graphviz) -
dir_data_networkx
: store .dot (and .svg) files for visualization (networkx) -
dir_data_pickle
: path to directory to store all stuff needed for training (final output ofPrepareData
) -
mapping_labels
: a dict that maps each label to an int64 code (label is the same as subsfolder names under eachdir_data...
, int64 code is used for training) -
do_draw
: iftrue
, generate graphviz and networkx graph for visualization and save todir_data_graphviz
anddir_data_networkx
-
split_train_test
:true
if we want the code to split the set intotrain
andtest
set.false
if we've already had a list of files used for train and test.
NOTE 1: Even ifsplit_train_test = false
, these two files can still be overwritten (filepaths listed in these two files which does not have api calls (cannot construct graph) will be removed from the list)
NOTE 2:test
set here is used only for testing inference. During training, thetrain
set here will be further divided intotrain
,val
set (refer toapp.py
) -
train_ratio
: used only ifsplit_train_test = true
(must> 0
if used) -
train_list_file
: path to save list of files we want to use for training. (Ifsplit_train_test = false
, file must exist and not empty) -
test_list_file
: path to save list of files we want to use for testing. (Ifsplit_train_test = false
, file must exist and not empty) -
process_from
: accept 3 values:report
:- Read inputs from
dir_data_report
. - Process behaviors and save processed behaviors to
dir_data_json
. - (optional) Save graphviz/networkx graph to
dir_data_graphviz
anddir_data_networkx
. - Run node/edge embedding.
- Create dgl graph and save to
dir_data_graph
. - Pack all necessary stuff and save to
dir_data_pickle
- Read inputs from
json
:- Read inputs from
dir_data_json
. - Perform
c -> f
like above (report
option)
- Read inputs from
graph
:- Read inputs from
dir_data_graph
. - Pack all necessary stuff and save to
dir_data_pickle
- Read inputs from
-
train_embedder
: train node/edge embedder. Can= true
only ifprocess_from = 'report'
-
node_embedder
,edge_embedder
: accepttfidf
ordoc2vec
-
max_ft
,top_k
: params fortfidf
embedder -
vector_size
,dm
: params fordoc2vec
embedder
Train
python app.py
config.model.json
Modify {
"dir_data_pickle": "data/pickle/TuTu_sm",
"mapping_labels": {"benign": 0, "malware": 1},
"model_config": {
"layer_type": "edGNNLayer",
"layer_params": {
"n_units": [
8,
8
],
"activation": [
"relu",
"relu",
"relu"
]
}
},
"learning_config": {
"lr": 0.001,
"epochs": 100,
"weight_decay": 0.001,
"batch_size": 16,
"gpu": -1
},
"early_stopping": {
"patience": 100
}
}