A framework for large-sacle static program analysis. The master branch is in version 1.0-SNAPSHOT.
As BigSpa is built on the big data processing platform Spark, distributed file system HDFS and distributed in-memory database Redis, you need to get these installed first.
We use Maven to build our project. To build BigSpa you need:
- Unix-like environment
- Git
- Maven
- Java 8 or 11
git clone https://github.com/PasaLab/BigSpa.git
cd BigSpa
mvn clean install -DskipTests
To perform offline batch or online incremental static program analysis using BigSpa, you can write scripts in the following format.
spark-submit \
--master yarn \
--deploy-mode client \
--name TASK_NAME \
--class Redis_pt \ # main function
--num-executors 16 \
--executor-cores 24 \
--executor-memory 16G \
--conf spark.storage.unrollMemoryThreshold=10000000 \
--conf spark.locality.wait.time=1ms \
--conf spark.locality.wait.node=1ms \
--files data/pasa.conf.prop \
--driver-memory 16G \
--driver-class-path /home/user/class/path \
\
BigSpa-1.0-SNAPSHOT-jar-with-dependencies.jar \ # run the jar package
\
islocal,false \
master,hdfs://master:9001/ \
input_graph,/path/to/graph/data/InputGraph/hdfs_pointsto \ # the input graph data
input_grammar,/path/to/grammar/data/alias-complete.grammar \ # grammar
output,/BigSpa_Output/result/hdfs_pt_Redis \
checkpoint_output,hdfs://master:9001/BigSpa/checkpoint1 \
updateRedis_interval,500000 \
queryRedis_interval,50000 \
\
defaultpar,1152 \
clusterpar,384 \
newnum_interval,100000000 \
checkpoint_interval,20 \
\
file_index_f,0 \ # for Linux data input
file_index_b,12 \
\
check_edge,false \
outputdetails,false \
output_Par_INFO,false \
\
Split_Threshold,1000000 # threshold of node split
The script can be run in the directory where the JAR package is located. For a description of the importent parameters, see the table below.
Parameter | Description | Value for Reference | |
---|---|---|---|
Spark script params | master | yarn | yarn |
deploy-mode | running mode | client | |
name | Spark APP Name | BigSpa.offline.psql.df | |
class | main function | OFFLINE.Redis_pt | |
num-executors | number of executors | 16 | |
executor-cores | number of cores of each executor | 24 | |
executor-memory | memory of each executor | 16G | |
conf | params of Spark or Java | ||
files | file path for Redis | data/pasa.conf.prop | |
driver-memory | 16G | ||
driver-class-path | |||
General parameters | islocal | whether to perform local debugging | FALSE |
master | HDFS address | hdfs://master:9001/ | |
input_graph | file path of the input graph | /data/linux.pt | |
input_grammar | file path of the grammar | /data/grammar | |
output | output path | /BigSpa/result/hdfs_pt_Redis | |
checkpoint_output | save path for checkpoint files | hdfs://master:9001/BigSpa/checkpoint | |
updateRedis_interval | batch size for updating Redis | 500000 | |
queryRedis_interval | batch size for querying Redis | 50000 | |
defaultpar | default number of partitions | 384/768/1152 | |
clusterpar | number of partitions in the cluster(num-executors*executor-cores) | ||
newnum_interval | threshold for automatically add partitions | ||
checkpoint_interval | cut off the lineage after how many iterations | ||
file_index_f | for Linux database, used to decide which files to merge and execute | 0 | |
file_index_b | for Linux database, used to decide which files to merge and execute | 12 | |
check_edge | whether to output edge information | FALSE | |
output_Par_INFO | whether to perform automatic partition adjustment | TRUE | |
param for node split | Split_Threshold | when the number of predicted collars exceeds$*16. split the node | |
params for computation closure | input_e | as E described in the paper | |
input_n | as N described in the paper | ||
is_complete_loop | whether to perform local closure operations | true | |
original_loop_turn | number of small local closure execution rounds | 5 | |
max_loop_turn | number of large local closure execution rounds | 100 | |
convergence_threshold | conditions for executing large local closures: when the number of new edges generated per round is less than $, large local closures are performed | 10000 | |
params for incremental updates | changemode_interval | calculation mode update threshold: when the number of new edges per round reaches $, switch from stand-alone to distributed | 50000 |
add | path of input batches | data/httpd.pt.batch/batches |