/benchpress

A Snakemake workflow to run and benchmark structure learning algorithms for probabilistic graphical models.

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

Benchpress logo

Snakemake Documentation Status License: GPL v2


Benchpress [1] is a Snakemake workflow where structure learning algorithms, implemented in possibly different languages, can be executed and compared. The computations scale seamlessly on multiple cores or "... to server, cluster, grid and cloud environments, without the need to modify the workflow definition" - Snakemake. The documentation is found at https://benchpressdocs.readthedocs.io.

The following main functionalities are provided by Benchpress

  • Benchmarks - Benchmark publically available structure learning algorithms.
  • Algorithm development - Benchmark your own algorithm along with the existing ones while developing.
  • Data analysis - Estimate the underlying graph structure for your own dataset(s).

Requirements

Linux

Notes

Some systems require explicit installation of squash-tools. Using conda it can be installed as

$ conda install -c conda-forge squash-tools

macOS/Windows

Benchpress cannot run directly on macOS/Windows as it requires Singularity which is only supported by Linux systems. However, Linux can be installed (and the requirements above) on a virtual machine via e.g. VirtualBox.

Installation

Clone and install

As Benchpress is a Snakemake workflow, once the requirements are installed it requires no further installation but cloning the repository as

$ git clone https://github.com/felixleopoldo/benchpress.git
$ cd benchpress

Notes

If you are using a virtualiser such as VirtualBox, this folder should still be located on macOS/Windows and shared to the virtual machine. In this way, all the files used by Benchpress are reachable from macOS/Windows.

Usage

Benchpress supports five different data scenarios, built from combining different sources of graph parameters and data.

Graph Parameters Data
I - - Fixed
II Fixed - Fixed
III Fixed Fixed Generated
IV Fixed Generated Generated
V Generated Generated Generated

The directory resources/ contains the fixed graphs, parameters, and datasets that are already available. It containts, e.g., all the graphs (and corresponding parameters) from the Bayesian networks repository, downloaded from bnlearns homepage. You can also place your own files in the corresponding directories and use them in the same way as the existing ones. The methods to generate graphs, parameters and data are listed further down.

Example study

This study is an example of data scenario V based on three continuous datasets corresponing to three realisations of a random linear Gaussian structural equation model (SEM) with random DAG. The DAGs are sampled from a restricted Erdős–Rényi distribution using the pcalg_randdag module and the weight parameters are sampled uniformly using the sem_params module. For simplicity we use only a few structure learning modules here (bidag_itsearch, tetrad_fges, bnlearn_tabu, pcalg_pc) with different parameter settings. The full setup is found here config/config.json.

To run this study (333 jobs ~ 10 minutes on a 2-cores laptop) type

$ snakemake --cores all --use-singularity --configfile config/config.json

The following plots are generated by the benchmarks module

drawingdrawing

drawingF1

These plots are generated by the graph_plots module

True adjacency matrix plotEstimated adjacency matrix plot

Adjacency difference matrix plotgraphviz.compare plot from bnlearn

Getting started with your own structure learning algorithm

You can easily use your own algorithm in Benchpress regardless of the programming language. To get the idea, perhaps the best way to start is to first use and alter the template R-script mylib_myalg.R as you like. It looks like this at the moment:

myparam1 <- snakemake@wildcards[["myparam1"]] 
myparam2 <- snakemake@wildcards[["myparam2"]]
data <- read.csv(snakemake@input[["data"]], check.names = FALSE)

# This is a very fast way to estimate an undirected graph.
p <- ncol(data)
set.seed(as.integer(snakemake@wildcards[["replicate"]]))
start <- proc.time()[1]
adjmat <- matrix(runif(p * p), nrow = p, ncol = p) > 0.8 
adjmat <- 1 * (adjmat | t(adjmat)) # Make it symmetric (undirected)
diag(adjmat) <- 0 # No self loops
colnames(adjmat) <- names(data) # Get labels from the data
totaltime <- proc.time()[1] - start

write.csv(adjmat, file = snakemake@output[["adjmat"]], row.names = FALSE, quote = FALSE)
write(totaltime, file = snakemake@output[["time"]])
write("None", file = snakemake@output[["ntests"]]) # Number of c.i. tests

The parameters used in the first two lines above are automatically generated from the JSON object in the mylib_myalg section of config/config.json. Feel free to add or change these keys or values. To test it you will have to add testing_myalg e.g. to the list of ids in the benchmarks section.

{
    "id": "testing_myalg",
    "myparam1": "somevalue",
    "myparam2": [
        1,
        2
    ]
}

If you want to use another programming language or link to some of your own scripts, you can edit mylib_myalg.smk to suite your algorithm.

if "mylib_myalg" in pattern_strings:
    rule mylib_myalg:
        input:                        
            justatrigger="workflow/scripts/structure_learning_algorithms/mylib_myalg.R",
            data = alg_input_data()        
        output:
            adjmat = alg_output_adjmat_path("mylib_myalg"),
            time = alg_output_time_path("mylib_myalg"),
            ntests = alg_output_ntests_path("mylib_myalg")
        container:
            None 
        script:            
            "../scripts/structure_learning_algorithms/mylib_myalg.R"

If R is not installed on your system, you may change the container from None to "docker://r-base" in order to run the script in a Singularity container based on the r-base Docker image.

To upload your algorithm to Benchpress, you should install it in a Docker image, push it to Docker Hub, and align the algorithm with the existing ones following CONTRIBUTING.md.

Available modules

Graph modules

Method Graph Language Library Version Module id
randDAG DAG,UG R pcalg 2.7-3 pcalg_randdag
graph.sim DG,UG R BDgraph 2.64 bdgraph_graphsim
CTA [24] DG Python trilearn 1.2.3 trilearn_cta
AR DG Python trilearn 1.2.3 bandmat
AR random lag DG Python trilearn 1.2.3 rand_bandmat
Fixed adjacency matrix * .csv - - -

Parameter modules

Distribution Method Graph Language Library Version Module id
Graph Wishart rgwish DG, UG R BDgraph 2.64 bdgraph_rgwish
Hyper Dirichlet [2] - DG Python trilearn 1.2.3 trilearn_hyper-dir
Graph intra-class - DG Python trilearn 1.2.3 trilearn_intra-class
Random SEM parameters - DAG R - - sem_params
Random probability tables - DAG R - - bin_bn
Fixed bn.fit object - DAG .rds bnlearn - -
Fixed SEM parameter matrix - DAG .csv - - -

Data modules

Method Language Module id
I.I.D. data samples - iid
SEM I.I.D. data samples Python gcastle_iidsimulation
Fixed data file .csv -

Structure learning algorithms

Algorithm Graph Language Library Version Module id
GOBNILP [3][33][34] DAG C GOBNILP #4347c64 gobnilp
ASOBS [15] DAG R/Java r.blip 1.1 rblip_asobs
FGES [9] CPDAG Java TETRAD 1.1.3 tetrad_fges
FCI [5] DAG Java TETRAD 1.1.3 tetrad_fci
RFCI [22] CPDAG Java TETRAD 1.1.3 tetrad_rfci
GFCI [21] DAG Java TETRAD 1.1.3 tetrad_gfci
PC [4][5] CPDAG R pcalg 2.7-3 pcalg_pc
Dual PC [31] CPDAG R dualPC 4a5175d dualpc
No tears [17] DAG Python jmoss20/notears #0c032a0 notears
No tears DAG Python gCastle 1.0.3rc3 gcastle_notears
PC CPDAG Python gCastle 1.0.3rc3 gcastle_pc
ANM DAG Python gCastle 1.0.3rc3 gcastle_anm
Direct LiNGAM DAG Python gCastle 1.0.3rc3 gcastle_direct_lingam
ICALiNGAM DAG Python gCastle 1.0.3rc3 gcastle_ica_lingam
NOTEARS-MLP DAG Python gCastle 1.0.3rc3 gcastle_notears_nonlinear
NOTEARS-SOM DAG Python gCastle 1.0.3rc3 gcastle_notears_nonlinear
NOTEARS-LOW-RANK DAG Python gCastle 1.0.3rc3 gcastle_notears_low_rank
GOLEM DAG Python gCastle 1.0.3rc3 gcastle_golem
GraNDAG DAG Python gCastle 1.0.3rc3 gcastle_grandag
MCSL DAG Python gCastle 1.0.3rc3 gcastle_mcsl
RL DAG Python gCastle 1.0.3rc3 gcastle_rl
CORL DAG Python gCastle 1.0.3rc3 gcastle_corl
HC [6] DAG R bnlearn 4.7 bnlearn_hc
MMHC [23] DAG R bnlearn 4.7 bnlearn_mmhc
Inter-IAMB [27] CPDAG R bnlearn 4.7 bnlearn_interiamb
GS [26] DAG R bnlearn 4.7 bnlearn_gs
Tabu [25] DAG R bnlearn 4.7 bnlearn_tabu
PC stable [4][5] CPDAG R bnlearn 4.7 bnlearn_pcstable
IAMB [27] DAG R bnlearn 4.7 bnlearn_iamb
Fast IAMB DAG R bnlearn 4.7 bnlearn_fastiamb
IAMB FDR DAG R bnlearn 4.7 bnlearn_iambfdr
MMPC DAG R bnlearn 4.7 bnlearn_mmpc
SI HITON-PC DAG R bnlearn 4.7 bnlearn_sihitonpc
Hybrid PC DAG R bnlearn 4.7 bnlearn_hpc
H2PC DAG R bnlearn 4.7 bnlearn_h2pc
RSMAX2 DAG R bnlearn 4.7 bnlearn_rsmax2
Iterative MCMC [28] DAG R BiDAG 2.0.3 bidag_itsearch
Order MCMC [28][29] DAG R BiDAG 2.0.3 bidag_order_mcmc
Partition MCMC [30] DAG R BiDAG 2.0.3 bidag_partition_mcmc
PGibbs [20] DG Python trilearn 1.2.3 trilearn_pgibbs
GG99 single pair [18] DG Java A. Thomas - gg99_singlepair
GT13 multi pair [19] DG Java A. Thomas - gt13_multipair
Parallel DG DG Python parallelDG 0.3 parallelDG
GLasso [31] UG Python scikit-learn 0.22.1 sklearn_glasso

Evaluation modules

Function Language Library Module id
Plot data with ggpairs R GGally ggally_ggpairs
Plot true graphs - graphviz graph_true_plots
Plot true graphs properties R ggplot2 graph_true_stats
Plot estimated graphs - graphviz graph_plots
Timing and ROC curves for TPR,FPR,FNR,... R ggplot2 benchmarks
MCMC mean graph Python seaborn mcmc_heatmaps
MCMC auto-correlation Python pandas mcmc_autocorr_plots
MCMC trajectory Python pandas mcmc_traj_plots

Acronyms are used for Directed Acyclic Graphs (DAGs), Undirected Graphs (UGs), Decomposable Graphs (DGs), and Completed Partially DAGs (CPDAGs).

Citing

@misc{rios2021benchpress,
      title={Benchpress: a scalable and platform-independent workflow for benchmarking structure learning algorithms for graphical models}, 
      author={Felix L. Rios and Giusi Moffa and Jack Kuipers},
      year={2021},
      eprint={2107.03863},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

Contact

Send an email to felix leopoldo rios at gmail com for questions.

Contributing

Contrubutions are very welcomed. See CONTRIBUTING.md for instructions.

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Open a pull request

License

This project is licensed under the GPL-2.0 License - see the LICENSE file for details

References