mSWEEP

Fast and accurate bacterial community composition estimation on within-species level by using pseudoalignments and variational inference.

More about mSWEEP in the article High-resolution sweep metagenomics using fast probabilistic inference in Wellcome Open Research.

Installation

In addition to mSWEEP, you will need to install Themisto for pseudoalignment.

Conda

Install mSWEEP from bioconda with

conda install -y -c bioconda -c conda-forge msweep

check that the installation succeeded by running

mSWEEP --help

Precompiled binaries

Precompiled binaries are available for

Linux x86_64
macOS arm64
macOS x86_64

from the Releases page.

Building from source

Requirements

C++17 compliant compiler.
cmake (v3.11 or newer)
git

Optional

Compiler with OpenMP support.

If your compiler does not support OpenMP, mSWEEP can only be run in single-threaded mode. The prebuilt binaries are compiled with OpenMP support.

Compiling

Clone the mSWEEP repository

git clone https://github.com/PROBIC/mSWEEP.git

enter the directory and run

> mkdir build
> cd build
> cmake ..
> make

This will compile the mSWEEP executable in build/bin/mSWEEP.

For more info on compiling mSWEEP from source, please see the documentation on compiling mSWEEP.

Enabling GPU acceleration

Compiling mSWEEP with GPU support requires installing

LibTorch
... and CUDA Toolkit (LibTorch w/ CUDA support)
... or ROCm (LibTorch w/ ROCm)

then, build mSWEEP with

> mkdir build
> cd build
> cmake -DCMAKE_LIBTORCH_PATH=/absolute/path/to/libtorch ..
> make

where /absolute/path/to/libtorch should be the absolute (!) path to the root of the LibTorch distribution.

Compiling mSWEEP with LibTorch support enables the rcggpu and emgpu options for --algorithm which directs the abundance estimation to run on the GPU if one is available.

Both algorithms can also be run on the CPU. Compared to the default algorithm, on the CPU

rcggpu is faster but uses more memory.
emgpu is slower but uses less memory.

See docs/gpubenchmarks.md for more details.

Usage

More information about using mSWEEP is available in the usage documentation.

Abundance estimation

Estimate relative abundances of Themisto pseudoalignments fwd.txt and rev.txt using the lineages in clustering.txt with two threads by running

mSWEEP --themisto-1 fwd.txt --themisto-2 rev.txt -i clustering.txt -t 2

mSWEEP --themisto fwd.txt,rev.txt -i clustering.txt -t 2

Both commands above will print the results. To write the result into a file called result_abundances.txt instead, run

mSWEEP --themisto-1 fwd.txt --themisto-2 rev.txt -i clustering.txt -t 2 -o result

Binning reads

Estimate relative abundances and bin reads for all lineages with a relative abundance higher than 0.01 by running

mSWEEP --themisto-1 fwd.txt --themisto-2 rev.txt -i clustering.txt -t 2 --min-abundance 0.01

Extract reads only for reference lineages lineage_1 and lineage_2 by running

mSWEEP --themisto-1 fwd.txt --themisto-2 rev.txt -i clustering.txt -t 2 --target-groups lineage_1,lineage_2

Extract reads only for reference lineages lineage_1 and lineage_2 if their relative abundance is higher than 0.01 by running

mSWEEP --themisto-1 fwd.txt --themisto-2 rev.txt -i clustering.txt -t 2 --target-groups lineage_1,lineage_2 --min-abundance 0.01

i.e. the file format is automatically detected (alignment-writer v0.4.0 and newer).

QC'ing binned reads

We recommend running demix_check on the binned reads and/or checkm on the bin-assembled genomes (BAGs) to evaluate the accuracy of the results.

Working with large alignment files

Compressing Themisto output files

For complex input data with many organisms, the pseudoalignment files from Themisto can get infeasibly large. In these cases, alignment-writer can be used to compress the alignment files to <10% of the original size.

mSWEEP >=v2.0.0 can read the compressed alignments in directly by running

mSWEEP --themisto-1 fwd_compressed.aln --themisto-2 rev_compressed.aln -i clustering.txt -t 2

Running estimation on large sparse alignments

If the target alignment is sparse, meaning that there are target groups which have few/no reads aligning against them in the whole sample, mSWEEP can be instructed to ignore these in the estimation by adding the --min-hits 1 flag:

mSWEEP --themisto sparse.aln -i clustering.txt -t 2 --min-hits 1

This will reduce the runtime and memory use of the estimation proportional to how many target groups are removed. Using --min-hits 1 does not affect the results beyond differences in computational accuracy.

The --min-hits flag also accepts values higher than 1 for pruning target groups with a small number of aligned reads. Using a value higher than 1 will change the resulting values.

(experimental) Reliability of abundance estimates

Add the --run-rate flag to calculate a relative reliability value for each abundance estimate using a variation of the RATE method

mSWEEP --themisto compressed.aln -i clustering.txt -t 2 --run-rate

This will append the RATE and KLD columns to the output. RATE values that exceed 1/(number of lineages in clustering.txt) are considered reliably estimated.

If the reference contains many sequences that have zero or very few pseudoalignments, the denominator should be set to the number of lineages that have a nonzero abundance estimate instead of the total lineage count.

A reliably estimated value means that an abundance estimate from mSWEEP has a large effect on how well the statistical model in mSWEEP fits to the input alignment data. This translates to a high value in the KLD column and the RATE columns, which is derived from the KLD values by dividing each value by the sum of all KLDs.

RATE as implemented in mSWEEP has not been tested thoroughly and is considered experimental. Consider using additional methods to verify the correctness of your results after filtering by RATE.

More options

mSWEEP additionally accepts the following flags:

Usage: mSWEEP --themisto-1 <forwardPseudoalignments> --themisto-2 <reversePseudoalignments> -i <groupIndicatorsFile>
--verbose	Print status messages to cerr.
--version	Print mSWEEP version.
--cite	Print citation information.
--help	Print the help message.

Pseudoalignment files (required: -1 and -2, or only -x; will try to read from cin if none are given):
--themisto-1	Pseudoalignment results from Themisto for the 1st strand of paired-end reads.
--themisto-2	Pseudoalignment results from Themisto for the 2nd strand of paired-end reads.
--themisto	Single themisto alignment file or a comma separated list of several files.

Group indicators (required):
-i	Group indicators for the pseudoalignment reference.

Output prefix:
-o	Prefix for output files written from mSWEEP (default: print to cout).

Binning options:
--bin-reads	Run the mGEMS binning algorithm and write bins to the directory `-o` points to (default: false).
--target-groups	Only extract these groups, supply as comma separated list (default: extract all groups).
--min-abundance	Only extract groups that have a relative abundance higher than this value (default: 0).

Output options:
--write-probs	If specified, write the estimated read-to-group probabilities to a file with "_probs.tsv" suffix (default:false).
--print-probs	Print the read equivalence class probabilities to cout even if `-o` is given (default: false).
--write-likelihood	Write the internal likelihood matrix to a file with "_likelihoods.txt" suffix (default: false).
--write-likelihood-bitseq	Write the likelihoods in a format can be parsed by BitSeq's (https://github.com/bitseq/bitseq) functions (default: false).
--compress	Compress all output files using the given algorithm (one of z, bz2, lzma; default: don't compress).
--compression-level	Compression level (0-9; default: 6).

Input options:
--themisto-mode	How to merge pseudoalignments for paired-end reads (intersection, union, or unpaired; default: intersection).
--read-likelihood	Path to a precomputed likelihood file written with the --write-likelihood toggle. Can't be used with --bin-reads.

Estimation options:
-t	How many threads to use in abundance estimation (default: 1).
--no-fit-model	Do not estimate the abundances. Useful if only the likelihood matrix is required (default: false).
--max-iters	Maximum number of iterations to run the abundance estimation optimizer for (default: 5000).
--tol	Optimization terminates when the bound changes by less than the given tolerance (default: 0.000001).
--algorithm Which algorithm to use for abundance estimation (one of rcggpu, emgpu, rcgcpu (original mSWEEP); default: rcgcpu).
--emprecision   Precision to use for the emgpu algorithm (one of float, double; default: double).

Bootstrapping options:
--iters	Number of times to rerun estimation with bootstrapped alignments (default: 0).
--seed	Seed for the random generator used in bootstrapping (default: random).
--bootstrap-count	How many pseudoalignments to resample when bootstrapping (default: number of reads).

Likelihood options:
-q	Mean for the beta-binomial component (default: 0.65).
-e	Dispersion term for the beta-binomial component (default: 0.01).
--alphas	Prior counts for the relative abundances, supply as comma-separated nonzero values (default: all 1.0).
--zero-inflation	Likelihood of an observation that contains 0 pseudoalignments against a reference group (default: 0.01).

Experimental options:
--run-rate	Calculate relative reliability for each abundance estimate using RATE (default: false).
--min-hits	Only consider target groups that have at least this many reads align to any sequence in them (default: 0).

References

Abundance estimation

If you use mSWEEP for abundance estimation, please cite us as

Mäklin T, Kallonen T, David S, Boinett CJ, Pascoe B, Méric G, Aanensen DM, Feil EJ, Baker S, Parkhill J, Sheppard SK, Corander J, and Honkela A
High-resolution sweep metagenomics using fast probabilistic inference [version 2; peer review: 2 approved]
Wellcome Open Resesearch 5:14 (2021)
https://doi.org/10.12688/wellcomeopenres.15639.2

Binning

The binning algorithm enabled by the --bin-reads toggle is described in

Mäklin T, Kallonen T, Alanko J, Samuelsen Ø, Hegstad K, Mäkinen V, Corander J, Heinz E, and Honkela A
Bacterial genomic epidemiology with mixed samples
Microbial Genomics 7:11 (2021)
https://doi.org/10.1099/mgen.0.000691

Thee binning algorithm is also provided as the standalone software mGEMS.

Specific versions

If you wish to cite a specific version of mSWEEP, visit the releases page and find the doi for the version of the program that you used. Then, cite the version as

Tommi Mäklin, and Antti Honkela. (2021).
PROBIC/mSWEEP: mSWEEP v1.5.0 (15 October 2021)
(v1.5.0). Zenodo. (https://doi.org/10.5281/zenodo.5571944)

License

The source code from this project is subject to the terms of the MIT license. A copy of the MIT license is supplied with the project, or can be obtained at https://opensource.org/licenses/MIT.