/prosolo

ProSolo, variant calling from single cell DNA-seq data, or: bulk backing vocals for single cell solos.

Primary LanguageRustGNU General Public License v3.0GPL-3.0

ProSolo: bulk backing vocals for single cell solos

install with bioconda Anaconda-Server Badge Zenodo DOI

ProSolo is a variant caller for multiple displacement amplified DNA sequencing data from diploid single cells. It relies on a pair of samples, where one is from an MDA single cell and the other from a bulk sample of the same cell population, sequenced with any next-generation sequencing technology.

It uses an extension of the novel latent variable model of Varlociraptor, that already integrates various levels of uncertainty. It adds a layer that accounts for amplification biases and errors of MDA, and thereby allows to properly asses the probability of having a variant or any other single cell event in the MDA single cell.

In the future, ProSolo will also implement indel calling, but currently, only the the single-cell-bulk subcommand with the --omit-indels flag is recommended.

Installation

ProSolo is available via Bioconda, a distribution of bioinformatics software for the conda package manager. Bioconda can be set up in any Linux environment, even without admin rights. With Bioconda set up, ProSolo can be installed into a dedicated conda environment via

$ conda create -n prosolo prosolo

To then use it, you only need to activate its environment with

$ conda activate prosolo

Usage

For general usage, please issue prosolo --help on the command line. Help is also available for all subcommands listed there, e.g. prosolo single-cell-bulk --help.

Variant Calling

To try out calling command syntax, please use the test data in the repo folder tests/ as follows:

prosolo single-cell-bulk \
    --omit-indels \
    --candidates tests/candidates.bcf \
    --output test-out_omit-indels.bcf \
    tests/single-cell.bam \
    tests/bulk.bam \
    tests/ref.fa

As input, you always need:

  1. Aligned reads of a single cell sample, in position sorted BAM or SAM format.
  2. Aligned reads from a background bulk sample, in position sorted BAM or SAM format.
  3. The reference FASTA file used in the alignment.
  4. A file specifying which genomic sites to call and what possible alternative alleles to look at. This can be any VCF or BCF file, e.g. produced by freebayes or samtools mpileup.

If you want to parallelize variant calling, you can parallelize over genomic regions. Simply split your --candidates files into non-overlapping regions or even already parallelize the candidate file generation over genomic regions.

Controlling the false discovery rate

To control the false discovery rate (FDR) for the minimal output of the above example, use its expected output data in the repo folder tests/ as follows:

prosolo control-fdr \
    tests/expected-out_omit-indels.bcf \
    --events ADO_TO_REF,ADO_TO_ALT \
    --var SNV \
    --fdr 0.04
    --output ADO_fdr_0-04.bcf

In this case, we are jointly controlling the FDR for all Events that are allele dropouts in the single cell sample. For this set of Events, the above command will print a BCF file with all the entries with an estimated false discovery rate below 0.04.

However, you can control the false discovery rate for any combination of events. The output of the prosolo single-cell-bulk command will always contain a PROB_ field for every possible single cell event, and you can simply name the Events you jointly care about by omitting this prefix. E.g. to control the false discovery rate of the presence of an alternative allele, you would use --events HET,HOM_ALT,ADO_TO_ALT,ADO_TO_REF,ERR_REF to control for all the single cell events that imply the presence of an alternative allele.

Using ProSolo in a pipeline

For systematic analyses of multiple single cell samples, we recommend using the Snakemake workflow management system. To make setting up your own ProSolo workflow as easy as possible, we have created Snakemake wrappers for ProSolo's two subcommands, prosolo single-cell-bulk and prosolo control-fdr. For general usage of Snakemake wrappers, see the Snakemake wrappers documentation. And also have a look around for other wrappers: for most tools that you might need for preprocessing ProSolo input or for working with ProSolo results, there should already be Snakemake wrappers that are easy to integrate into a Snakemake pipeline.

Authors

Model

Implementation

Supervision

Affiliations

Affiliations during work on the project: