/parSMURF

High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

parSMURF

This package contains parSMURF, a High Performance Computing imbalance-aware machine learning tool for the genome-wide detection of pathogenic variants.


Table of Contents

Overview
Requirements
Downloading and compiling
General architecture
Running parSMURF
	Command line options
	Running parSMURF1
	Running parSMURFn
	Running the Bayesian optimizer
	Configuration file
		name
		exec
		data
		simulate
		folds
		params
		autogp_params
Data Format
	Data file format
	Label file format
	Fold file format
	Output file format
Random dataset generation
Examples
License

Overview

parSMURF is a fast and scalable C++ implementation of the HyperSMURF algorithm - hyper-ensemble of SMOTE Undersampled Random Forests - an ensemble approach explicitly designed to deal with the huge imbalance between deleterious and neutral variants.

The algorithm is outlined in the following papers:
A. Petrini, M. Mesiti, M. Schubach, M. Frasca, D. Danis, M. Re, G.Grossi, L. Cappelletti, T. Castrignanò, P. N. Robinson, and G. Valentini, "parSMURF, a high-performance computing tool for the genome-wide detection of pathogenic variants", GigaScience, vol. 9, 05 2020. giaa052. https://doi.org/10.1093/gigascience/giaa052

Schubach, Matteo Re, Peter N. Robinson & Giorgio Valentini, "Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants", Scientific Reports, 2017/06/07
https://www.nature.com/articles/s41598-017-03011-5

Two variants of parSMURF are currently available in this repository:

  • "parSMURF1" is a fast multi-threaded implementation of the algorithm and is meant to be run on a single machine
  • "parSMURFn" is a multi-threaded and parallel implementation (under the MPI programming paradigm) and is meant to be run on a single machine or on cluster

Both versions share the same design and functionalities outlined in the paper, in particular:

  • fast, optimized and scalable C++ implementation
  • auto tuning of the learning parameters by grid search or by means of a Bayesian optimizer

Requirements

parSMURF is designed for x86-64 and Intel Xeon Phi architectures running Linux OSes.
This software is distributed as source code.

A compilier which supports the C++11 language specification is required. It has been tested with GCC (vers. >= 5) and Intel CC (2015, 2017 and 2019).
Code is also optimized for Intel XeonPhi architectures, and it has been successfully tested on Knights Landing family processors.

Multithreading and multiprocessing are managed differently in parSMURF1 and parSMURFn: the former is a multithread-only implementation and thread management is performed through OpenMP APIs. Any reasonably recent compiler has its specification already built-in, hence this requirement is usually met. parSMURFn, instead, is a multiprocess and multithread implementation of the algorithm. Thread management is performed by the Linux built-in pthread library and multiprocessing is performed through the MPI APIs. Hence, for compilation and running, parSMURFn requires an implementation of the MPI standard. It has been tested with OpenMPI 1.10.3, OpenMPI 2.0, IntelMPI 2016, IntelMPI 2017 and IntelMPI 2019.

Notice that MPI is not required for parSMURF1, hence if no MPI libraries are found on the target system, is still possible to compile and run this version of the software

On Ubuntu, it is possible to install the OpenMPI library via apt package manager:

sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev

Makefiles are generated by the cmake (vers. >= 2.8) utility. On Ubuntu it is possible to install this package via apt:

sudo apt-get install cmake

Bayesian Optimization is done by the Spearmint package. This package require python2 and it depends on several Python packages. The best way to use this feature is by creating and configuring a Python virtual environment and installing the required Python packages there. On Ubuntu:

sudo apt-get install virtualenv
<move to an appropriate folder>
virtualenv parSMURFvenv -p /usr/bin/python2	#This command creates a parSMURFvenv directory
source parSMURFvenv/bin/activate		#This command activates the virtual environment
pip install numpy==1.13.0			#The following commands install the required packages in the virtual environment
pip install scipy==1.2.1
pip install weave==0.17.0
pip install six==1.12.0
pip install protobuf==3.7.1
deactivate					#Deactivate the virtual environment

parSMURF uses several external libraries that are included ad source code in this repository or are automatically downloaded and compiled. In particular, the following libraries are included:

  • ANN: A Library for Approximate Nearest Neighbor Searching, by David M. Mount and Sunil Arya, Version 1.1.2. The modified version is supplied in the src/ann_1.1.2 directory. This version has been adapted for multi-thread execution, since the original package available at https://www.cs.umd.edu/~mount/ANN/ is not thread safe and is not compatible with this package.
  • Ranger: A Fast Implementation of Random Forests, by Marvin N. Wright, Version 0.11.2. The modified version, stripped from the R code, is supplied in the src/ranger directory. The main codebase is located at https://github.com/imbs-hl/ranger
  • Spearmint, a Python package to perform Bayesian optimization, by Jasper Snoek. The original version at https://github.com/JasperSnoek/spearmint seems no longer maintained and needed a few updates to run on parSMURF.

The following libraries are not included in this code repository, but are automatically downloaded during the compilation process:

  • easylogging++: A single header C++ logging library, by Zuhd Web Services. Automatically cloned in src/easyloggingpp and compiled from https://github.com/zuhd-org/easyloggingpp
  • jsoncons: A C++, header-only library for constructing JSON and JSON-like text and binary data formats, by Daniel Parker. Automatically cloned in src/jsoncons and compiled from https://github.com/danielaparker/jsoncons
  • zlib: A massively spiffy yet delicately unobtrusive compression library, by Jean-loup Gailly and Mark Adler. Autmatically cloned from in src/zlib and compiled from https://github.com/madler/zlib

All the libraries have been modified and redistributed according to their own licenses. For each included library, a copy of the associated license is contained in each library folder.


Downloading and compiling

Download the latest version from this page or clone the git repository altogether:

git clone https://github.com/anacletolab/parSMURF

Once the package has been downloaded, move to the main directory, create a build dir, invoke cmake and build the software (-j n make option enables multithread compilation over n threads):

cd parSMURF
mkdir build
cd build
cmake ../src
make -j 4

This will generate two executables: "parSMURF1" and "parSMURFn".

For a quick test, launch the following command from the build directory:

./parSMURF1 --cfg ../cfgEx/simulCV.json

General architecture

While both versions strictly follow the paper and its original R implementation (available on CRAN repository https://cran.r-project.org/web/packages/hyperSMURF/index.html), the novelties of this package resides in the fast C++ code and in the parallel execution which lead to a dramatic decrease of the computing time while keeping the same results, in term of quality of prediction, of the original implementation. Also, it features two different approaches for automatically find the best learning parameters.

Hence, execution roughly follows this scheme:

- data reading from file(s) (or random dataset generation)
- folds and partitions generation [by index!]
- for each fold
---- for each partition in the current fold
---- ---- over-sampling of the minority class and under-sampling of the majority class
---- ---- random forest training
---- ---- random forest test
---- prediction accumulation
- prediction averaging

Results are evaluated according to an n-fold validation process. Folds can be randomly generated (the user is free to specify the number of folds) or can be read from a file. When randomly generated, folds are stratified, i.e. the generation algorithm tries to evenly distribute the number of positive examples amongst the folds.

Parallelization happens at partition level: since the SMOTE algorithm and the subsequent RF train and test stages are almost embarrassingly parallel inside each fold, (i.e. they require the same operations to be performed on different data, with no synchronization points or data communication involved) these steps can be executed concurrently for each partition belonging to the same fold.

In parSMURF1, this process is parallelized by means of multi-threading. As an example, if the user specifies x partitions and y processing threads, each thread is assigned x/y partitions which are sequentially processed by each thread. If enough cpu cores are available, each thread will execute concurrently, leading to an almost linear speed-up, especially on CPUs characterized by an high number of cores, like the Intel XeonPhi family of processors.

Parallelization in parSMURFn follows the same model which is further expanded for exploiting the computational power of several processing nodes in a cluster. The execution scheme follows a simple master-slave model, where a single master MPI process reads the data from file and delegates the processing of each partition (SMOTE and rf steps) to k working MPI processes. The master process also manages the recollection and accumulation of the predictions from the working processes. Moreover, as in parSMURFa, processing of the partitions in each working process is parallelized by means of multi-threading.

As an example, suppose that the user specifies x partitions, k working processes and y processing threads for each working process. The master process assigns x/k "chunks" of partitions to each working process and sends them the relevant data for the computation. Inside each working process, each chunk is further divided amongst the thread pool, and each thread is assigned to (x/k)/y partitions. Predictions for each chunks are locally accumulated inside each working process and are sent back to the master process only once the work for the chunk is finished.

Several strategies have been used to minimize latencies due to data transmission or broadcasting between the master and working processes, not limited to:

  • the master process sends only the data strictly needed for the computation of each partition; moreover, it is sent as a single big array with an header, instead of several small chunks.
  • sends and receives in the master process are managed in two different threads, hence interleaving data preparation + transmission and data receive.
  • sends in the master process can be single- or multi-threaded: in the latter case, the master process spawns a number of threads equal to the number of working processes, and each of these thread is assigned to prepare the data and send it to the corresponding worker, concurrently. This is the default operation mode, but might be memory consuming, therefore a command line option to disable this feature is provided.

parSMURF features two subsystems for the automatic fine tuning of the learning parameters, aimed to maximize the prediction performances of the algorithm. The first strategy is by performing an exhaustive grid search: given a set of values for each hyper-parameter, the resulting set of all the possible combinations of hyper-parameters is calculated, and each combination evaluated through internal cross validation. The other strategy is by Bayesian optimization: given a range for each hyper-parameter, the Bayesian optimizer generate a sequence of possible candidates whose sequence tends to a probable global maximum. An high level of the execution is given by this pseudo-code snippet:

iter = 0
- while (iter < maxIter) and (error > tolerance):
-- BO generates a new possible candidate of hyper-parameters h
-- evaluation of h in a context of internal cross validation
-- submit (h, AUPRC(h)) to the BO
-- iter <- iter + 1

Both strategies are performed in a context of internal cross validation, hence it is performed for each fold of the external CV. The output of the procedure is the set of best learning parameter for each fold of the external cross validation.


Running parSMURF

parSMURF is a command line executable.
All the options are submitted to the main executable through configuration file written in json format.

Command line options

Only two command line options are available, since every other parameter or option is defined by json configuration files.
--cfg <filename> specifies the configuration file for the run
--help prints a brief help screen

Running parSMURF1

parSMURF1 does not require anything special to run, besides a proper configuration file. Hence, it can be launched as following:

./parSMURF1 --cfg <configFile.json>

Running parSMURFn

parSMURFn requires MPI to be installed on the target system or in all the nodes of a cluster. It must be invoked with mpirun or, depending on the scheduling system installed on the cluster, with a proper mpirun wrapper.
The -n option of mpirun also specifies how many processes have to be launched. parSMURFn requires at least two processes, one as master and one as worker. As an example:

mpirun -n 5 ./parSMURFn --cfg <configFile.json>

launches an instance of parSMURFn over 5 processes (one master and four worker).
As now, the number of master process is limited to one.

Running the Bayesian optimizer

Using the Bayesian optimizer requires more effort, but we are currently finding a way to properly manage the whole procedure more user friendly.
As noted in the "Requirements" section, it may be preferable to setup a Python virtual environment and launch parSMURF1 or parSMURFn from there.
Also, the entire src/spearmint folder must be copied in the same directory where the parSMURF executable is.
As final requirement, the environmental variable PYTHONPATH must contain the path to the Spearmint folder.\

As an example, assume that the git repository has been copied to /home/user01/git/parSMURF and the package has been successfully compiled in the /home/user01/git/parSMURF/build directory. Also, assume that a Python virtual environment has been created and is located at /home/user01/pythonVenvs/parSMURFvenv.
To prepare a folder containing everything it is needed to parSMURF to run, do the following:

mkdir /home/user01/parSMURFexp
cd /home/user01/parSMURFexp
cp /home/user01/git/parSMURF/build/parSMURF1 .
cp /home/user01/git/parSMURF/build/parSMURFn .
cp -r /home/user01/git/parSMURF/src/spearmint .

Now for launching an experiment with the Bayesian optimizer, do the following:

cd /home/user01/parSMURFexp
export PYTHONPATH=$PYTHONPATH:/home/user01/parSMURFexp/spearmint/spearmint:/home/user01/parSMURFexp/spearmint/spearmint/spearmint
source /home/user01/pythonVenvs/parSMURFvenv/bin/activate
<launch parSMURF1 or parSMURFn as stated earlier>
deactivate

Configuration file

parSMURF1 and parSMURFn use configuration files in json format for setting the parameters of each run.
Examples of configuration files are available in the cfgEx folder of the repository.

A configuration file is composed by seven dictionaries:

{
	"name": ...,
	"exec": {...},
	"data": {...},
	"folds": {...},
	"simulate": {...},
	"params": {...},
	"autogp_params": {...}
}

Depending on the configuration itself, some dictionaries are not mandatory and can be left out.

"name"
	"name": string

Mandatory: no
Exec: parSMURF1 / parSMURFn
A string for labeling the name of the experiment

"exec"
	"exec": {
		"name": string,
		"nProcs": int,
		"ensThrd": int,
		"rfThrd": int,
		"noMtSender": bool,
		"seed": int,
		"verboseLevel": int,
		"verboseMPI": bool,
		"saveTime": bool,
		"timeFile": string,
		"printCfg": bool,
		"mode": string
},

Mandatory: yes
Exec: parSMURF1 / parSMURFn
General configuration of the run.

	"name": string

Mandatory: No
Exec: parSMURF1 / parSMURFn
Label used for marking the name of the executable (parSMURF1 or parSMURFn). It does not affect the computation itself, since this field is ignored by the json parser

	"nProcs": int

Mandatory: No
Exec: parSMURFn
Label used for marking the number of processes for a run of parSMURFn. It does not affect the computation itself, since the total number of processes is detected at runtime by the MPI APIs.

	"ensThrd": int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Number of threads assigned to perform the partition processing.

	"rfThrd": int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Number of threads assigned to perform the random forest train and test.

	"noMtSender": bool

Mandatory: No
Exec: parSMURFn
This option disables multithreading in the master process. It may affect performances, but it may be necessary when processing particularly large datasets.

	"seed": int

Mandatory: No
Exec: parSMURF1 / parSMURFn
Optional seed for the random number generators. If unspecified, a random seed is generated.

	"verboseLevel": int

Mandatory: No
Exec: parSMURF1 / parSMURFn
Level of verbosity on stdout and on the logfile of the computational task. Range is 0-3 (default: 0).

	"verboseMPI": bool

Mandatory: No
Exec: parSMURFn
Verbose on stdout and logfile the calls to MPI APIs. (Default: false)

	"saveTime": bool

Mandatory: No
Exec: parSMURF1 / parSMURFn
Option for saving a report of the computation time of the run (Default: false)

	"timeFile": string

Mandatory: Yes, if "saveTime" is set to true
Exec: parSMURF1 / parSMURFn
File name for saving the execution time report

	"printCfg": bool

Mandatory: No
Exec: parSMURF1 / parSMURFn
Option for printing a detailed description of the run before it starts (Default: false)

	"mode": string

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Execution mode. Allowed strings are:
"cv": Dataset is splitted in folds, and evaluated in a process of k-fold cross validation. The run returns a set of predictions (default).
"train": The whole dataset is treated as training set. The run returns a folder of trained models for later usage.
"test": The whole dataset is treated as test set. It is mandatory to submit a directory of trained models to perform the evaluation. The run returns a set of predictions.
Note that the autotuning of the learning parameters is available only for "cv" mode

	"optimizer": string

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Execution mode. Allowed strings are:
"no": external cross-validation only (default)
"grid": automatic tuning of the learning parameters by grid search in the internal cross validation loop
"autogp": automatic tuning of the learning parameters by Bayesian optimization (Gaussian process) in the internal cross validation loop

"data"
"data": {
	"dataFile": string
	"foldFile": string
	"labelFile": string
	"outFile": string
	"forestDir": string
}

Mandatory: yes
Exec: parSMURF1 / parSMURFn
This field contains all the required information for accessing data from and to the system.

	"dataFile": string

Mandatory: Yes (No if simulation mode is enabled)
Exec: parSMURF1 / parSMURFn
Input data file

	"foldFile": string

Mandatory: No
Exec: parSMURF1 / parSMURFn
Optional input file containing the fold division of the dataset

	"labelFile": string

Mandatory: Yes (No, if simulation mode is enabled)
Exec: parSMURF1 / parSMURFn
Input file containing the labels of the examples of the dataset

	"outFile": string

Mandatory: Yes (No, if in train mode)
Exec: parSMURF1 / parSMURFn
Output file containing the output predictions

	"forestDir": string

Mandatory: No (Yes, if in train mode)
Exec: parSMURF1 / parSMURFn
Output directory for saving the trained models. Must be a valid directory on the filesystem.

"simulate"
"simulate": {
	"simulation": bool,
	"prob": float,
	"n": int,
	"m": int
},

Mandatory: no
Exec: parSMURF1 / parSMURFn
This field contains all the required information for enabling the internal dataset generator

	"simulation": bool

Mandatory: No
Exec: parSMURF1 / parSMURFn
On true, it enables the internal dataset generator. The fields "dataFile", "foldFile" and "labelFile" are ignored and a random dataset is generated.

	"prob": float

Mandatory: Yes if simulation mode is enabled
Exec: parSMURF1 / parSMURFn
This field represent the probability of generating a positive example. Must be a float in the [0,1] range, possibly very small for simulating highly unbalanced datasets

	"n": int

Mandatory: Yes if simulation mode is enabled
Exec: parSMURF1 / parSMURFn
Number of examples to be generated

	"m": int

Mandatory: Yes if simulation mode is enabled
Exec: parSMURF1 / parSMURFn
Number of features to be generated

"folds"
"folds": {
	"nFolds": int,
	"startingFold": int,
	"endingFold": int
}

Mandatory: Yes (No, if "foldFile" specified)
Exec: parSMURF1 / parSMURFn
This section specified the fold subdivision and to which fold execute the run.

	"nFolds": int

Mandatory: Yes (No, if "foldFile" specified)
Exec: parSMURF1 / parSMURFn
This field specifies in how many folds the dataset should be subdivided into. Ignored if "foldFile" has been declared.

	"startingFold": int,
	"endingFold": int

Mandatory: No
Exec: parSMURF1 / parSMURFn
These fields specify the starting and ending fold that parSMURF have to evaluate. This is useful for parallelizing runs across different folds. If unspecified, parSMURF performs the evaluation of the predictions on all folds.

"params"
"params": {
	"nParts": array of int,
	"fp": array of int,
	"ratio": array of int,
	"k": array of int,
	"nTrees": array of int,
	"mtry": array of int
},

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
This field contains the learning parameters for the run. All values must be passed as arrays.
When "optimizer" is set to "no", only one combination is used for the run.
When "optimizer" is set to "grid", parSMURF generates all the possible hyper-parameter combinations and evaluate them in the internla CV loop.
For a deeper explanation of each parameter, please refer to the article

	"nParts": array of int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Number of partitions (ensembles)
Default: 10

	"fp": array of int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Over-sampling factor (0 disables over-sampling)
Default: 1

	"ratio": array of int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Under-sampling factor (0 disables under-sampling)
Defaul: 1

	"k": array of int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Number of the nearest neighbors for SMOTE oversampling of the minority class
Default: 5

	"nTrees": array of int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
Number of trees in each ensemble
Default: 10

	"mtry": array of int

Mandatory: Yes
Exec: parSMURF1 / parSMURFn
mtry random forest parameter
Default: sqrt(m)

"autogp_params"
	"autogp_params":
		"nParts" : {
			"name": "nParts",
			"type": "int",
			"min": int,
			"max": int,
			"size": 1
		},
		"fp" : {
			"name": "fp",
			"type": "int",
			"min": int,
			"max": int,
			"size": 1
		},
		"ratio" : {
			"name": "ratio",
			"type": "int",
			"min": int,
			"max": int,
			"size": 1
		},
		"k" : {
			"name": "k",
			"type": "int",
			"min": int,
			"max": int,
			"size": 1
		},
		"numTrees" : {
			"name": "numTrees",
			"type": "int",
			"min": int,
			"max": int,
			"size": 1
		},
		"mtry" : {
			"name": "mtry",
			"type": "int",
			"min": int,
			"max": int,
			"size": 1
		}

Mandatory: No (Yes, if "optimizer" is set to "autogp")
Exec: parSMURF1 / parSMURFn
This section is used for defining the search space of the Bayesing optimizer. It is composed by six sub-fields, each one defining the search space of one learning parameter. The only parts that can be modified are the "min" and "max" fields of each parameters.
Every sub-field is mandatory. If the user needs to perform a partial search (i.e. tuning only some of the six parameters), please set the "min" and "max" values of the fixed parameters to the same value.


Data format

As previously stated, data is provided to the application in two or three files.

Data file

this file should contain the main data needed for computing the predictions. It consists in an n x m matrix of double, where n is the number of examples and m the features. The matrix is read row-wise, i.e. :

   | m1   m2   m3   m4 ...
---------------------------
n1 | ------------>
n2 | ------------>
n3 |
n4 |
.  |
.  |
.  |

Most, if not any, datafile is in this format, so just be sure that the number of features for each row is consistent across the samples.
The number of features is detected from the file itself - actually, from the number of items read in the first row.
All input files must be HEADERLESS.

Label file

this file should contain the labelling of the examples. It consists in n space or tab separated values, where n is the number of examples. It can also be a column vector file, i.e. newline separated values.
It is a plain text file where each positive example is marked with "1" and negative examples with "0".

Fold file

this optional file should contain the fold sub division. If specified, examples will be divided in folds as specified in this file. If not, a random stratified division will be performed. This file consists in n space or tab separated integer values, where n is the number of examples. It can also be a column vector file, i.e. newline separated values.
It is a plain text file where each number represents the fold to which each example is assigned. Fold numbering starts from "0" (zero). Note that specifying the fold file name overrides the "nFolds" option in the ocnfiguration file.

The following code snippet converts two R vectors in the corresponding labelling and folding files for proper use with this package:

write(vectorOfLabels, file = "labels.txt", sep = "\n")
write(vectorOfFolds, file = "folds.txt", sep = "\n")
Output file

Predictions will be saved as plain text file.
The output file consists of two columns of tab separated double values. For each sample, both probabilities of belonging to either class is saved: each value in the first column represents the probability of the associated sample to be in the minority class, while each value in the second column, the probability to be in the majority class.

Note about dimensionality:
When reading data from file, parSMURF1 and parSMURFn automatically detect the number of samples and features, following these rules:

  • at first, the number of samples is detected from the label file.
  • then, the number of features is detected from the data file, evaluating the number of different items from hte first row of the data file. Hence, the sizes of these files should be consistent, otherwise a warning message is printed to the console.
    Also, the number of folds is detected from the fold file if specified. In this case, the option "nFolds" in the configuration file is ignored, and the total number of folds will be equal to the number of the total unique elements of the fold file.

Random dataset generation

parSMURF1 and parSMURFn are provided with a random dataset generator for testing purposes.
When enabled, a random dataset will be created according to two normal distribution having the same variance but different average value, depending if an example falls in the positive or negative class.
The user enables this mode by using the "simulate: true" option in the configuration file.
The user is also forced to specify the the probability that an example belongs to the minority class ("prob": float) and dimensionality of the dataset with the "n": float and "m": int options.
An additional column will be added to the output file, containing the labelling that has been randomly generated according to the "prob" value.


Examples

Folder cfgEx of the repository contains several example of configuration files to be used either with parSMURF1 or parSMURFn.\

  • simulCV.json (for parSMURF1): it generates a random dataset of 1200 examples and 25 features; probability of a positive example is very low (0.02). Execute a 10-fold cross validation with random stratified fold sub-division. Learning parameters are fixed to: nParts = 10, fp = ratio = 1, k = 5, nTrees = 10, mtry = 5. Results are saved into the "predicitons.txt" file. Also a report of the execution time is generated in the timeout.txt. Seed fixed at 1. parSMURF spawns 4 threads for partition processing, and for each one of them it spawns another thread for random forest train and test.
  • simulCVn.json (for parSMURFn): it generates a random dataset of 12000 examples and 75 features; probability of a positive example is very low (0.025). Execute a 10-fold cross validation with random stratified fold sub-division. Learning parameters are fixed to: nParts = 100, fp = 1, ratio = 2, k = 3, nTrees = 100, mtry = 9. Results are saved into the "predicitons.txt" file. Also a report of the execution time is generated in the timeout.txt. Seed fixed at 1. It must be launched as mpirun -n 5 ./parSMURFn --cfg simulCVn.json, so that 4 worker processes are spawned, each one with 6 threads for partition processing, and for each of them 2 threads for random forest train and test are spawned. This execution also verbose to stdout all the MPI API calls.
  • dataFromFile.json (for parSMURF1): execute a 10 fold cross validation over the dataset read from file. Fold subdivision is specified in the "folds.txt" file. No hyper-parameters autotuning.
  • gridTune.json (for parSMURF1): data is read from file, as well for the labelling. Folds are randomly generated. Execute an partial automatic tuning of the learning parameters over a 5-fold cross validation. Parameters to be tuned are: nParts, fp and mtry. This configuration generates 18 possible hyper-parameter combinations that are tested in the internal cross validation. AUPRC results for each combination are saved in the files "fold0.dat" to "fold4.dat". It also generates a prediction file contianing the predictions for each fold obtained by the best hyper-parameter combiantion for the relative fold.
  • gridTune2.json (for parSMURF1): as in gridTune.json, but the whole procedure is executed over folds 3 and 4 only.
  • train.json (for parSMURFn):data is read from file, as well for the labelling. Parameter "nFolds" is ignored. Treats the whole dataset as training set and generates a trained model. The model is saved in the "/home/user01/models/trainedModel/" folder. It must be launched as mpirun -n 2 ./parSMURFn --cfg train.json. 1 worker process, with 3 threads for partition processing and 4 for random forest train and test. Logs are more verbose than the previous examples. Multithreading in the master process is disabled.
  • autoGpTune.json (for parSMURFn): full auto-tuning of the learning parameters via Bayesian Optimization. Data, labels and fold sub-division are read from file. Parameter "nFolds" is ignored. "params" section of the config file is ignored as well. The parameter search space is defined as follows: nParams in [10, 50], fp in [1, 3], ratio in [1, 3], k in [2, 6], nTrees in [5, 10], mtry in [2, 5].

License

This package is distributed under the GNU GPLv3 license. Please see the http://github.com/anacletolab/parSMURF/LICENSE file for the complete version of the license.

parSMURF includes several third-party libraries which are distributed with their own license. In particular, source code of the following libraries is included in this package:

ANN: Approximate Nearest Neighbor Searching
David M. Mount and Sunil Arya
Version 1.1.2
(https://www.cs.umd.edu/~mount/ANN/)
Modified and redistributed under the GNU Lesser Public License v2.1
Copy of the license is available in the src/ann_1.1.2 directory

Ranger: A Fast Implementation of Random Forests
Marvin N. Wright
Version 0.11.1
(https://github.com/imbs-hl/ranger)
Modified and redistributed under the MIT license
Copy of the license is available in the src/ranger folder

Spearmint
Jasper Snoek, Hugo Larochelle and Ryan P. Adams
(https://github.com/JasperSnoek/spearmint/)
Modified and redistributed under the NU General Public License v3
Copy of the license is available in the src/spearmint/spearmint folder

Also, parSMURF uses several libraries whose source code is not included in the package, but it is automatically downloaded at compile time. These libraries are:

Easylogging++
Zuhd Web Services
(https://github.com/zuhd-org/easyloggingpp)
Distributed under the MIT license
Copy of the license is available at the project homepage

Jsoncons
Daniel Parker
(https://github.com/danielaparker/jsoncons)
Distributed under the Boost license
Copy of the license is available at the project homepage

zlib
Jean-loup Gailly and Mark Adler
(https://github.com/madler/zlib)
Distributed under the zlib license
Copy of the license is available at the project homepage