/luminate

Denoising longitudinal microbiome data.

Primary LanguagePython

LUMINATE

LUMINATE (longitudinal microbiome inference and zero detection) includes four programs for inference in longitudinal microbiome datasets:

  • estimate: estimate relative abundances and posterior probabilities of biological zeros.
  • train: estimate model parameters for compositional Lotka-Volterra (cLV).
  • predict (experimental): predict longitudinal trajectories one time point at a time.
  • plot: generate a stacked bar plot of relative abundances and external effects

Dependencies

Program dependencies are listed in requirements.txt. Install dependencies with pip install -r requirements.txt.

Quick Start

See ./run-examples.sh to run each program on an example dataset.

Input

Each program takes as input an OTU Table (taxa by samples) with the following structure:

  • The first row gives IDs for each longitudinal sequence.
  • The second row gives sample times.
  • The first column gives taxon names.

See datasets/bucci2016mdsine/cdiff-counts.csv for an example.

A second optional table providing external perturbations can be specified. This is a .csv file where the columns are sequenceID,eventID,startDay,endDay,magnitude(optional)

The first column links an event to a longitudinal sample through sequenceID. This should correspond to an ID in the OTU Table. The second column gives a name for each event: parameters are inferred for each unique event name.

The start day and stop day give a range of time over which an event occurs (end point included). For example, these could be a range of time when a patient receives antibiotics. For one time events, set startDay=stopDay.

The magnitude column is used to estimate the parameters of cLV. If unspecified, events are treated like indicator variables: set to 1 during the time of an event and 0 otherwise.

Worked Example

Here we run through a typical workflow for LUMINATE using the data in bucci2016mdsine. First, let's take a look at the data by running

python main.py plot \
               datasets/bucci2016mdsine/cdiff-counts.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -i datasets/bucci2016mdsine \
               -o datasets/bucci2016mdsine

This will produce one stacked bar plot per sequence of relative abundances over time, along with indicators for external events. The top 19 taxa each receive a unique color, while the remaining taxa are aggregated in a single component.

Here is an example, using the first sequence in the C. diff dataset:

Example plot

Estimate

Next, we want to estimate relative abundances from sequencing counts, and save the results:

python main.py estimate \
               datasets/bucci2016mdsine/cdiff-counts.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -o datasets/bucci2016mdsine

The second position argument specifies the path to the OTU table. The optional argument -e (--effects) specifies a filepath to the .csv of external events. Finally -o (--outdir) gives a directory to store results for downstream analysis.

Running estimate produces 3 files:

  • cdiff-counts-est.csv : the estimated relative abundances
  • cdiff-counts-nonzero-posterior-prob.csv: posterior probabilities of sampling zeros (i.e. a taxon is below the detection threshold as opposed to absent from the community)
  • P.pkl : a temporary result used for downstream analysis

Let's plot the estimated relative abundances to ensure nothing went wrong:

python main.py plot \
               datasets/bucci2016mdsine/cdiff-counts-est.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -i datasets/bucci2016mdsine \
               -o datasets/bucci2016mdsine

Estimated relative abundances

Note on data sparsity

The information available to reconstruct a trajectory is proportional to the amount of nonmissing (nonzero) data. For ultra-sparse taxa (>90% missing entries), there is little information its trajectory over time. We recommend filtering out such taxa as a pre-processing step.

Train

Note: Estimates of cLV parameters with LUMINATE is considered experimental. To train model parameters using pseudo-counts instead of denoised estimates, pass the --use-pseudo-counts flag instead.

Next, we want to estimate the parameters of cLV using the estimated relative abundances:

python main.py train \
               datasets/bucci2016mdsine/cdiff-counts-est.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -i datasets/bucci2016mdsine \
               -o datasets/bucci2016mdsine \
               -b 200

The -i (--indir) flag tells LUMINATE to look in this directory for estimated relative abundances (P.pkl). If this is not found, then estimate will run by default. The -o (--outdir) flag tells LUMINATE to save model parameters to this folder. Finally, the -b (--bootstrap) flag estimates one-sided p-values for model parameters.

This outputs several files

	A : the matrix of taxa interactions
	g : relative growth rate vector
	B : effects of external perturbation

The rows of each matrix, A, g, B, are coordinates under the additive log ratio transformation. Bootstrap one-sided p-values are in A_pval, g_pval, B_pval. These are the estimated probabilities that the nonzero coefficients estimated on the training data have the same sign. For example, if A_{ij} = 1 the computed p-value is Prob(A_{ij} > 0). Note, the minimum possible p-value is 1/b so p-values below this threshold will appear as zeros.

Predict (Experimental)

Finally, if we want to predict trajectories one step at a time we can call

python main.py predict \
               datasets/bucci2016mdsine/cdiff-counts-est.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -i datasets/bucci2016mdsine \
               -o datasets/bucci2016mdsine \
               --one-step

This produces an OTU table of predicted trajectories. Each trajectory is predicted one time point in the future.

Plotting the results:

python main.py plot \
               datasets/bucci2016mdsine/cdiff-counts-est-pred.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -i datasets/bucci2016mdsine \
               -o datasets/bucci2016mdsine

Estimated relative abundances

To make a prediction from initial conditions, omit the --one-step flag:

python main.py predict \
               datasets/bucci2016mdsine/cdiff-counts-est.csv \
               -e datasets/bucci2016mdsine/cdiff-events.csv \
               -i datasets/bucci2016mdsine \
               -o datasets/bucci2016mdsine

Program Options

usage: main.py [-h] [-e EVENTS] [-o OUTDIR] [-i INDIR] [-b BOOTSTRAP] [-s]
               command otu-table

Time-series modeling for the microbiome

positional arguments:
  command               Specify analysis to run. One of:
                        train,predict,estimate.
  otu-table             Filepath to OTU table csv.

optional arguments:
  -h, --help            show this help message and exit
  -e EVENTS, --events EVENTS
                        Filepath to table of external events.
  -o OUTDIR, --outdir OUTDIR
                        Specify output directory to store results. Default is
                        current directory.
  -i INDIR, --indir INDIR
                        Specify input directory to load previously computed
                        parameters: typically the OUTDIR from a previous run.
  -b BOOTSTRAP, --bootstrap BOOTSTRAP
                        Perform bootstrap resampling to estimate one-sided
                        p-values of cLV coefficients. The argument specifies
                        the number of bootstrap replicates to perform. Will
                        produce a warning if the sample size is too small
                        (N<30).
  -s, --one-step        Perform one-step prediction instead of prediction from
                        initial conditions.
  -p, --use-pseudo-count
                        Estimate relative abundances using pseudo-counts instead of denoising step.

Code to Replicate Simulation Experiments

The directory experiments contains simulation code to regenerate simulated datasets used for evaluation in the manuscript. The script simulate_glv_from_data.py generates the simulations used the main model comparison. The script simulate_glv_with_zeros.py generates the simulations used in the biological zero evaluation. Simulated datasets are output to experiments/datasets.