This project aims at providing a fast and flexible solution for the estimation of covariate effects on individual microbes of a sequenced metagenomic sample.
├── data
│ ├── __init__.py
│ └── simulated <-- simulated datasets
│
├── lib <-- dmbvs binary location
│
├── utils <-- package with internal functions
│ ├── __init__.py
│ ├── cli.py <-- command line interface
│ ├── data.py <-- data processing and internal representation
│ ├── distributions.py <-- implementation of custom distributions (Dirichlet-Multinomial)
│ ├── dmbvs.py <-- wrapping functions to call dmbvs
│ ├── result_analysis.py <-- computation of summary statistics for traces
│ ├── resultfile_processing.py <-- formatting and parsing of result file names
│ └── sampling.py <-- functions to setup hmc sampling
│
├── README.md <-- this file
├── analyse_single_trace.py <-- compute summary statistics and inclusion probabilities
| for a single trace
├── benchmark_oracle_guess.py <-- run sampling on multiple datasets and with(out) oracle guess
├── combine_dmbvs_results.py <-- read results from individual dmbvs runs and combine them into
| a joint dataset contain information about dataset parameters
├── combine_hmc_results.py <-- read results from multiple hmc sampling runs of the devised model and
| combine them into a joint dataset
├── dm_regression_model.py <-- definition of multiple models based on the sparsity inducing horseshoe prior
| and the Dirichlet-Multinomial distribution
├── dmbvs_wrapper.py <-- run dmbvs algorithm on multiple datasets
└── run_sampling.py <-- run hmc sampling on model
To get started after installing the software, it is required to provide per sample species wise read counts and covariates in a tab separated values file (tsv).
Alternatively, the files in the folder data/simulated
can be used for testing and benchmarking. In this folder simulated simulated datasets of different dimensionality and data availability can be found. The folder names follow the scheme <#species/OTUs>O_<#covaraites>C_<#covariates relevant>p0_<#samples>S_<replica>R
and contain the files alphas.tsv
(groundtruth regression intercepts), betas.tsv
(groundtruth regression coefficients), YY.tsv
(sample wise read counts of species abundances) and XX.tsv
(sample wise covariate values).
Using the run_sampling.py
script, the parameters of a statistical model can be inferred using MCMC sampling.
This script additionally generates a file with summary statistics and inclusion probabilities of individual covariates.
The detailed usage of this script is described below.
This project is written in python 3.6, thus a python interpreter with at least this version should be installed for the software to function. The software was tested on UNIX based operating systems (Linux and Mac OS X). Although much effort was put into writing platform independent code, due to insufficient testing problems could arise on Windows operating systems.
Clone the repository and change to the root directory.
Add the root directory to your PYTHONPATH
environment variable (e.g. via the .bashrc file) like this:
export PYTHONPATH=$PYTHONPATH:/path/to/HorseshoeDMRegression
Activate your virtual environment, if you intend to use one (recommended). Install the requirements with
pip install -r requirements.txt
The parameters of statistical models are inferred using the run_sampling.py
script:
usage: run_sampling.py [-h] [--transpose-counts] --estimated_covariates
ESTIMATED_COVARIATES -o OUTPUT [--n_chains N_CHAINS]
[--n_tune N_TUNE] [--n_draws N_DRAWS] [--seed SEED]
[--model_type {DMRegression,DMRegressionMixed,DMRegressionDMixed,MvNormalDMRegression,SoftmaxRegression}]
[--traceplot] [--save_model] [--save_trace]
countdata metadata
Infer covariate effects using a dirichlet multinomial regression model with
horseshoe sparsity induction.
positional arguments:
countdata Read counts associated with individual microbes in tab
separated format with layout: Samples (rows) x Microbes
(columns).
metadata Covariates associated with samples in tab separated
format with layout: Samples (rows) x Covariates
(columns).
optional arguments:
-h, --help show this help message and exit
--transpose-counts Accept read counts in tab separated format with
Microbes (rows) x Samples (columns) layout instead.
--estimated_covariates ESTIMATED_COVARIATES
Guess on number of relevant covariates, provide -1 if
no guess should be used (scale of cauchy on tau = 1).
-o OUTPUT, --output OUTPUT
Directory to store results.
Sampling options:
--n_chains N_CHAINS Number of chains to runn in parallel.
--n_tune N_TUNE Number of tuning steps that should be descarded as
"burn-in".
--n_draws N_DRAWS Number of samples to generate after tuning
--seed SEED Random seed
--model_type {DMRegression,DMRegressionMixed,DMRegressionDMixed,MvNormalDMRegression,SoftmaxRegression}
Model type to use, further details see readme.
Output options:
--traceplot Save traveplot of sampling for diagnostics.
--save_model Save model.
--save_trace Store trace for later analysis.
Example running sampling on the 40O_10C_20p0_100S_1R
dataset:
python run_sampling.py data/simulated/40O_10C_20p0_100S_1R/YY.tsv data/simulated/40O_10C_20p0_100S_1R/XX.tsv -o sampling_testrun --save_trace --traceplot
Running this command will generate three files, trace.pck
containing sampler statistics and generated samples, traceplot.pdf
containing histograms of the sampled values for diagnostic purposes and variable_statistics.xlsx
containing summary statistics of inferred parameters and inclusion probabilities.
- DMRegression: Dirichlet multinomial regression model with horseshoe sparsity induction. Regression coefficients as well as intercepts are estimated jointly for all samples.
- DMRegressionMixed: Dirichlet multinomial regression model with horseshoe sparsity induction, hierarchical mixed model. Regression coefficients as well as intercepts are estimated jointly. For each patient, additional mixed effects are estimated. Requires additional patient column in covariate file to determine repeated measurements form same patient
- DMRegressionDMixed: Dirichlet multinomial regression model with horseshoe sparsity induction, hierarchical mixed model. Regression coefficients as well as intercepts are estimated jointly. Additional dirichlet clustered random effects are included to in the model, no patient column required. Assumes that certain groups can have distinct microbiome compositions (enterotypes).
- MvNormalDMRegression: Dirichlet multinomial regression model with horseshoe sparsity induction, intercepts sampled from Multivariate Normal. Regression coefficients are estimated jointly, intercepts are modeled to be independently drawn from a Multivariate Gaussian distribution of which the Covariance matrix and the Mean are estimated jointly for all samples.
The analyse_single_trace.py script, allows to compute summary statistics using a provided model, trace and input data.
The benchmark_oracle_guess.py script can be used to run sampling on multiple simulated datasets for performance benchmarking.
The combine_dmbvs_results.py and combine_hmc_results.py scripts were written to crawl all runs in a path that were either executed using benchmark_oracle_guess.py for hmc sampling or dmbvs_wrapper.py for the dmbvs algorithm, compute summary statistics of all runs and variables and store these computations in a table format to allow easy comparison of performance benchmarks.
The dmbvs_wrapper.py script can be used to run a competitor algorithm (dmbvs) for performance comparison. The script was implemented to remove the requirement of installing R
.
All scripts give further information on their usage using the command script.py --help
.