benchpress: A Python repository from ncherric

Benchpress [1] is a Snakemake workflow where structure learning algorithms, implemented in possibly different languages, can be executed and compared. The computations scale seamlessly on multiple cores or "... to server, cluster, grid and cloud environments, without the need to modify the workflow definition" - Snakemake. The documentation is found at https://benchpressdocs.readthedocs.io.

The following main functionalities are provided by Benchpress

Benchmarks - Benchmark publically available structure learning algorithms.
Algorithm development - Benchmark your own algorithm along with the existing ones while developing.
Data analysis - Estimate the underlying graph structure for your own dataset(s).

Requirements

Linux

Notes

Some systems require explicit installation of squash-tools. Using conda it can be installed as

$ conda install -c conda-forge squash-tools

macOS/Windows

Benchpress cannot run directly on macOS/Windows as it requires Singularity which is only supported by Linux systems. However, Linux can be installed (and the requirements above) on a virtual machine via e.g. VirtualBox.

VirtualBox (instructions for installing Ubuntu)

Installation

Clone and install

As Benchpress is a Snakemake workflow, once the requirements are installed it requires no further installation but cloning the repository as

$ git clone https://github.com/felixleopoldo/benchpress.git
$ cd benchpress

Notes

If you are using a virtualiser such as VirtualBox, this folder should still be located on macOS/Windows and shared to the virtual machine. In this way, all the files used by Benchpress are reachable from macOS/Windows.

Usage

Benchpress supports five different data scenarios, built from combining different sources of graph parameters and data.

	Graph	Parameters	Data
I	-	-	Fixed
II	Fixed	-	Fixed
III	Fixed	Fixed	Generated
IV	Fixed	Generated	Generated
V	Generated	Generated	Generated

The directory resources/ contains the fixed graphs, parameters, and datasets that are already available. It containts, e.g., all the graphs (and corresponding parameters) from the Bayesian networks repository, downloaded from bnlearns homepage. You can also place your own files in the corresponding directories and use them in the same way as the existing ones. The methods to generate graphs, parameters and data are listed further down.

Example study

This study is an example of data scenario V based on three continuous datasets corresponing to three realisations of a random linear Gaussian structural equation model (SEM) with random DAG. The DAGs are sampled from a restricted Erdős–Rényi distribution using the pcalg_randdag module and the weight parameters are sampled uniformly using the sem_params module. For simplicity we use only a few structure learning modules here (bidag_itsearch, tetrad_fges, bnlearn_tabu, pcalg_pc) with different parameter settings. The full setup is found here config/config.json.

To run this study (333 jobs ~ 10 minutes on a 2-cores laptop) type

$ snakemake --cores all --use-singularity --configfile config/config.json

The following plots are generated by the benchmarks module

These plots are generated by the graph_plots module

Getting started with your own structure learning algorithm

You can easily use your own algorithm in Benchpress regardless of the programming language. To get the idea, perhaps the best way to start is to first use and alter the template R-script mylib_myalg.R as you like. It looks like this at the moment:

myparam1 <- snakemake@wildcards[["myparam1"]] 
myparam2 <- snakemake@wildcards[["myparam2"]]
data <- read.csv(snakemake@input[["data"]], check.names = FALSE)

# This is a very fast way to estimate an undirected graph.
p <- ncol(data)
set.seed(as.integer(snakemake@wildcards[["replicate"]]))
start <- proc.time()[1]
adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > 0.8 
adjmat <- 1 * (adjmat | t(adjmat)) # Make it symmetric (undirected)
diag(adjmat) <- 0 # No self loops
colnames(adjmat) <- names(data) # Get labels from the data
totaltime <- proc.time()[1] - start

write.csv(adjmat, file = snakemake@output[["adjmat"]], row.names = FALSE, quote = FALSE)
write(totaltime, file = snakemake@output[["time"]])
write("None", file = snakemake@output[["ntests"]]) # Number of c.i. tests

The parameters used in the first two lines above are automatically generated from the JSON object in the mylib_myalg section of config/config.json. Feel free to add or change these keys or values. To test it you will have to add testing_myalg e.g. to the list of ids in the benchmarks section.

{
    "id": "testing_myalg",
    "myparam1": "somevalue",
    "myparam2": [
        1,
        2
    ]
}

If you want to use another programming language or link to some of your own scripts, you can edit mylib_myalg.smk to suite your algorithm.

if "mylib_myalg" in pattern_strings:
    rule mylib_myalg:
        input:                        
            justatrigger="workflow/scripts/structure_learning_algorithms/mylib_myalg.R",
            data = alg_input_data()        
        output:
            adjmat = alg_output_adjmat_path("mylib_myalg"),
            time = alg_output_time_path("mylib_myalg"),
            ntests = alg_output_ntests_path("mylib_myalg")
        container:
            None 
        script:            
            "../scripts/structure_learning_algorithms/mylib_myalg.R"

If R is not installed on your system, you may change the container from None to "docker://r-base" in order to run the script in a Singularity container based on the r-base Docker image.

To upload your algorithm to Benchpress, you should install it in a Docker image, push it to Docker Hub, and align the algorithm with the existing ones following CONTRIBUTING.md.

Available modules

Graph modules

Method	Graph	Language	Library	Version	Module id
randDAG	DAG,UG	R	pcalg	2.7-3	pcalg_randdag
graph.sim	DG,UG	R	BDgraph	2.64	bdgraph_graphsim
CTA [24]	DG	Python	trilearn	1.2.3	trilearn_cta
AR	DG	Python	trilearn	1.2.3	bandmat
AR random lag	DG	Python	trilearn	1.2.3	rand_bandmat
Fixed adjacency matrix	*	.csv	-	-	-

Parameter modules

Distribution	Method	Graph	Language	Library	Version	Module id
Graph Wishart	rgwish	DG, UG	R	BDgraph	2.64	bdgraph_rgwish
Hyper Dirichlet [2]	-	DG	Python	trilearn	1.2.3	trilearn_hyper-dir
Graph intra-class	-	DG	Python	trilearn	1.2.3	trilearn_intra-class
Random SEM parameters	-	DAG	R	-	-	sem_params
Random probability tables	-	DAG	R	-	-	bin_bn
Fixed bn.fit object	-	DAG	.rds	bnlearn	-	-
Fixed SEM parameter matrix	-	DAG	.csv	-	-	-

Data modules

Method	Language	Module id
I.I.D. data samples	-	iid
SEM I.I.D. data samples	Python	gcastle_iidsimulation
Fixed data file	.csv	-

Structure learning algorithms

Algorithm	Graph	Language	Library	Version	Module id
GOBNILP [3][33][34]	DAG	C	GOBNILP	#4347c64	gobnilp
ASOBS [15]	DAG	R/Java	r.blip	1.1	rblip_asobs
FGES [9]	CPDAG	Java	TETRAD	1.1.3	tetrad_fges
FCI [5]	DAG	Java	TETRAD	1.1.3	tetrad_fci
RFCI [22]	CPDAG	Java	TETRAD	1.1.3	tetrad_rfci
GFCI [21]	DAG	Java	TETRAD	1.1.3	tetrad_gfci
PC [4][5]	CPDAG	R	pcalg	2.7-3	pcalg_pc
Dual PC [31]	CPDAG	R	dualPC	4a5175d	dualpc
No tears [17]	DAG	Python	jmoss20/notears	#0c032a0	notears
No tears	DAG	Python	gCastle	1.0.3rc3	gcastle_notears
PC	CPDAG	Python	gCastle	1.0.3rc3	gcastle_pc
ANM	DAG	Python	gCastle	1.0.3rc3	gcastle_anm
Direct LiNGAM	DAG	Python	gCastle	1.0.3rc3	gcastle_direct_lingam
ICALiNGAM	DAG	Python	gCastle	1.0.3rc3	gcastle_ica_lingam
NOTEARS-MLP	DAG	Python	gCastle	1.0.3rc3	gcastle_notears_nonlinear
NOTEARS-SOM	DAG	Python	gCastle	1.0.3rc3	gcastle_notears_nonlinear
NOTEARS-LOW-RANK	DAG	Python	gCastle	1.0.3rc3	gcastle_notears_low_rank
GOLEM	DAG	Python	gCastle	1.0.3rc3	gcastle_golem
GraNDAG	DAG	Python	gCastle	1.0.3rc3	gcastle_grandag
MCSL	DAG	Python	gCastle	1.0.3rc3	gcastle_mcsl
RL	DAG	Python	gCastle	1.0.3rc3	gcastle_rl
CORL	DAG	Python	gCastle	1.0.3rc3	gcastle_corl
HC [6]	DAG	R	bnlearn	4.7	bnlearn_hc
MMHC [23]	DAG	R	bnlearn	4.7	bnlearn_mmhc
Inter-IAMB [27]	CPDAG	R	bnlearn	4.7	bnlearn_interiamb
GS [26]	DAG	R	bnlearn	4.7	bnlearn_gs
Tabu [25]	DAG	R	bnlearn	4.7	bnlearn_tabu
PC stable [4][5]	CPDAG	R	bnlearn	4.7	bnlearn_pcstable
IAMB [27]	DAG	R	bnlearn	4.7	bnlearn_iamb
Fast IAMB	DAG	R	bnlearn	4.7	bnlearn_fastiamb
IAMB FDR	DAG	R	bnlearn	4.7	bnlearn_iambfdr
MMPC	DAG	R	bnlearn	4.7	bnlearn_mmpc
SI HITON-PC	DAG	R	bnlearn	4.7	bnlearn_sihitonpc
Hybrid PC	DAG	R	bnlearn	4.7	bnlearn_hpc
H2PC	DAG	R	bnlearn	4.7	bnlearn_h2pc
RSMAX2	DAG	R	bnlearn	4.7	bnlearn_rsmax2
Iterative MCMC [28]	DAG	R	BiDAG	2.0.3	bidag_itsearch
Order MCMC [28][29]	DAG	R	BiDAG	2.0.3	bidag_order_mcmc
Partition MCMC [30]	DAG	R	BiDAG	2.0.3	bidag_partition_mcmc
PGibbs [20]	DG	Python	trilearn	1.2.3	trilearn_pgibbs
GG99 single pair [18]	DG	Java	A. Thomas	-	gg99_singlepair
GT13 multi pair [19]	DG	Java	A. Thomas	-	gt13_multipair
Parallel DG	DG	Python	parallelDG	0.3	parallelDG
GLasso [31]	UG	Python	scikit-learn	0.22.1	sklearn_glasso

Evaluation modules

Function	Language	Library	Module id
Plot data with ggpairs	R	GGally	ggally_ggpairs
Plot true graphs	-	graphviz	graph_true_plots
Plot true graphs properties	R	ggplot2	graph_true_stats
Plot estimated graphs	-	graphviz	graph_plots
Timing and ROC curves for TPR,FPR,FNR,...	R	ggplot2	benchmarks
MCMC mean graph	Python	seaborn	mcmc_heatmaps
MCMC auto-correlation	Python	pandas	mcmc_autocorr_plots
MCMC trajectory	Python	pandas	mcmc_traj_plots

Acronyms are used for Directed Acyclic Graphs (DAGs), Undirected Graphs (UGs), Decomposable Graphs (DGs), and Completed Partially DAGs (CPDAGs).

Citing

@misc{rios2021benchpress,
      title={Benchpress: a scalable and platform-independent workflow for benchmarking structure learning algorithms for graphical models}, 
      author={Felix L. Rios and Giusi Moffa and Jack Kuipers},
      year={2021},
      eprint={2107.03863},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

Contact

Send an email to felix leopoldo rios at gmail com for questions.

Contributing

Contrubutions are very welcomed. See CONTRIBUTING.md for instructions.

Fork it!
Create your feature branch: git checkout -b my-new-feature
Commit your changes: git commit -am 'Add some feature'
Push to the branch: git push origin my-new-feature
Open a pull request

License

This project is licensed under the GPL-2.0 License - see the LICENSE file for details

ncherric/benchpress

Requirements

Linux

Notes

macOS/Windows

Installation

Clone and install

Notes

Usage

Example study

Getting started with your own structure learning algorithm

Available modules

Graph modules

Parameter modules

Data modules

Structure learning algorithms

Evaluation modules

Citing

Contact

Contributing

License

References