/fastspar

Rapid and scalable correlation estimation for compositional data

Primary LanguageC++GNU General Public License v3.0GPL-3.0

FastSpar

Build Status License

Rapid and scalable correlation estimation for compositional data

Table of contents

Introduction

FastSpar is a C++ implementation of the SparCC algorithm which is up to several thousand times faster than the original Python2 release and uses much less memory. The FastSpar implementation provides threading support and a p-value estimator which accounts for the possibility of repetitious data permutations (see this paper for further details).

An important step of correlation analysis is removal of noise and dimension reduction. A common method to perform this is distribution-based clustering of OTUs. The aim is to reunite OTUs derived from sequencing error with the parent OTU by clustering raw OTUs based on nucleotide edit distance and count distribution. FastSpar is paired with an efficient implementation of the popular distribution-based clustering method dbOTU3.

Citation

If you use this tool, please cite the FastSpar paper and original SparCC paper:

Requirements

There are no requirements for using the pre-compiled static binaries on 64-bit linux distributions. Otherwise, there are several libraries which are required for building and running dynamically linked binaries. For further information, see Compiling from source.

Installing

FastSpar can be installed via pre-compiled binaries, Bioconda, or from source.

GNU/Linux

For most 64-bit linux distributions (e.g. Ubuntu, Debian, RedHat, etc) the easiest way to obtain FastSpar is via statically compiled binaries on the releases page. These binaries can be downloaded and run immediately without any setup as they have no dependencies.

Bioconda

FastSpar is available through Bioconda (thanks to @epruesse):

conda install -c bioconda -c conda-forge fastspar

Compiling from source

Compiling from source requires these libraries and software:

C++11 (gcc-4.9.0+, clang-4.9.0+, etc)
OpenMP 4.0+
Gfortran
Armadillo 6.7+
LAPACK
BLAS (OpenBLAS is recommended)
GNU Scientific Library 2.1+
GNU getopt
GNU make
GNU autoconf
GNU autoconf-archive
GNU m4

After meeting the above requirements, compiling and installing FastSpar from source can be done by:

git clone https://github.com/scwatts/fastspar.git
cd fastspar
./autogen.sh
./configure --prefix=/usr/
make
make install

Once completed, the FastSpar executables can be run from the command line.

Usage

Correlation inference

To run FastSpar, you must have absolute OTU counts in BIOM tsv format file (with no metadata). The fake_data.tsv (from the original SparCC implementation) will be used as an example:

fastspar --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

The number of iterations (rounds of SparCC correlation estimation) and exclusion iterations (the number of times highly correlation OTU pairs are discovered and excluded) can also be tweaked:

fastspar --iterations 50 --exclude_iterations 20 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

Further, the minimum threshold to exclude correlated OTU pairs can be increased:

fastspar --threshold 0.2 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

Calculation of exact p-values

There are several methods to calculate p-values for inferred correlations. Here we have elected to use a robust permutation based approach. This process involves inferring correlation from random permutations of the original OTU count data. The magnitude of each p-value is related to how often a more extreme correlation is observed for randomly permutated data. In the below example, we calculate p-values from 1000 bootstrap correlations.

First we generate the 1000 bootstrap counts:

mkdir bootstrap_counts
fastspar_bootstrap --otu_table tests/data/fake_data.tsv --number 1000 --prefix bootstrap_counts/fake_data

And then infer correlations for each bootstrap count (running in parallel with all processes available):

mkdir bootstrap_correlation
parallel fastspar --otu_table {} --correlation bootstrap_correlation/cor_{/} --covariance bootstrap_correlation/cov_{/} -i 5 ::: bootstrap_counts/*

From these correlations, the p-values are then calculated:

fastspar_pvalues --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --prefix bootstrap_correlation/cor_fake_data_ --permutations 1000 --outfile pvalues.tsv

Threading

If FastSpar is compiled with OpenMP, threading can be used by invoking --threads <thread_number> at the command line for several tools:

fastspar --otu_table tests/data/fake_data.txt --correlation median_correlation.tsv --covariance median_covariance.tsv --iterations 50 --threads 10

Contributors

  • Scott Ritchie
    • Advised on use of permutation based statistical testing
    • Provided an example use of statmod::permp

License

GNU General Public License v3.0