robustness: An R repository from borenstein-lab

This file contains instructions for obtaining and running this microbial community taxa-function robustness estimation pipeline. This pipeline takes as input a table of microbial read counts or abundances and produces an estimation of the community's taxa-function robustness. Additionally, various gene distribution feature values are calculated for the original community.

DOWNLOAD:

The source code for this microbial community taxa-function robustness estimation pipeline can be obtained by cloning this git repository:

git clone git@github.com:borenstein-lab/robustness.git

Dependencies

This pipeline relies on the following R packages that may not be installed by default: argparse, phyloseq, and RcppEigen.

Downloading required data files

PICRUSt and Greengenes files used by default by this pipeline should be downloaded from their original sources. To run the default pipeline, users should download the PICRUSt 16S rRNA gene copy number table and genomic content annotation table, saving them to the data directory. Additionally, the Greengenes tree (specifically gg_13_5_otus_99_annotated.tree.gz) can be downloaded from the Greengenes site (here) and unzipped and saved to the data directory.

USE:

The pipeline follows three main steps:

Generate perturbed taxonomic and functional compositions for each initial community composition
Measure taxonomic and functional differences between initial and perturbed compositions
Calculate robustness metrics

1. Perturbing community compositions

The first step is performed using the generate_perturbations.R script as follows:

src/generate_perturbations.R initial_community_taxonomic_composition perturbed_taxonomic_compositions perturbed_function_compositions perturbation_weighted_unifrac_distances

where the initial_community_taxonomic_composition file should have two columns, each with column headers. The header of the second column should be the name or ID for the initial community. The first column should contain the ID for a taxon while the second column should contain the corresponding taxon's relative abundance in the initial community.

There are several parameters that can be adjusted for this step, and all can be found by running

src/generate_perturbations.R

but the main parameters that may be of interest to modify are the set of maximum perturbation magnitudes to use (the --perturbation_list option), the number of replicates for each maximum perturbation magnitude (the --num_replicates option), the minimum and maximum range of Weighted UniFrac distances to generate perturbations for (the --range_min and --range_max options respectively), and the number of windows and perturbations per window to use when filtering the generated perturbations (the --num_windows and --num_perturbations options respectively).

2. Measuring taxonomic and functional differences

The second step is performed using the summarize_perturbations.R script as follows:

src/summarize_perturbations.R perturbed_taxonomic_compositions perturbed_function_compositions perturbation_weighted_unifrac_distances function_distribution_features perturbation_distances function_specific_differences

where perturbed_taxonomic_compositions, perturbed_function_compositions, and perturbation_weighted_unifrac_distances should be the output from the previous step in the pipeline.

3. Calculating robustness metrics

The third and final step is performed using the calculate_robustness_factors.R and calcualte_function_specific_robustness_factors.R scripts as follows:

src/calculate_robustness_factors.R perturbation_distances robustness_factors
src/calculate_function_specific_robustness_factors.R function_specific_differences function_specific_robustness_factors

where perturbation_distances and function_specific_differences should be the output from the previous step in the pipeline.