Toolkit to download datasets, generate virtual polyploids, phase them, and assess phasing quality.

Phasing Toolkit

The Phasing Toolkit is an open source collection of scripts which automate the steps of alignment-based phasing operations.

Currently, the focus is on making it easy to benchmark phasing algorithms with a focus on polyploid tools. Ultimately the goal is to develop a general purpose phasing toolkit. Everyone is welcome and encouraged to contribute to this project.

Installation

Install bioconda and set the correct channels by following the first two steps here: https://bioconda.github.io/user/install.html

Then you can create a new environment and install polyploidBenchmarking with the following commands

Create new conda environment

conda create -n polyploidBenchmarking python=3.8

Download prerequisite packages

conda activate polyploidBenchmarking
conda install -c oakheart nphase
conda install -c bioconda -c bioinf-mcb sra-downloader
conda install whatshap
conda install -c bioconda ncbi-genome-download
conda install psutil

If you wish to test flopp you must download it manually (any help automating this step would be appreciated), instructions are available here: https://github.com/bluenote-1577/flopp

Make sure "flopp" is in your $PATH (such that entering "flopp" into the command line launches it)

Download the phasing toolkit scripts

git clone https://github.com/OmarOakheart/Phasing-Toolkit.git

Quickstart

There are currently five scripts which all require a parameter file (benchmarkingParameters.tsv) as input. In its current state, the Phasing Toolkit is optimized to easily benchmark polyploid phasing algorithms on virtual polyploid datasets. A virtual polyploid is created by taking the sequencing reads of genetically distinct haploid or homozygous diploid organisms of the same species and merging them, thus using real reads to generate simulated polyploids. We can then run phasing algorithms on these simulated or virtual polyploids and compare the genotypes of the haploids to the predictions to calculate phasing accuracy metrics. This toolkit provides a way to perform all of these steps, from data download to accuracy metric calculation, with minimal effort.

Note: Virtual polyploids don't perfectly approximate real polyploid data and we welcome any contribution to this project for alternate ways of generating polyploid phasing benchmarking datasets.

Parameter file (benchmarkingParameters.tsv)

benchmarkingParameters.tsv

mainPath /path/to/outputFolder/
longReads ACA:ERR4352153 BMB:ERR4352154 CCN:ERR4352155 CRL:ERR4352156
longReadMethod ont
shortReads ACA:ERR1309429 BMB:ERR1308675 CCN:ERR1308732 CRL:ERR1308952
coverages 20 10
heterozygosityRates 0.05 0.1 0.5 1
reference taxID:559292 group:fungi speciesName:sCerevisiae
strainLists BMB,CCN ACA,BMB,CCN ACA,BMB,CCN,CRL
genomeSize 12500000
threads 8

The parameter file contains all of the information required by the various scripts of the phasing toolkit. It is tab-separated and contains multiple distinct lines. Line order is not important. We provide an example file, benchmarkingParameters.tsv, which can be used as a template. A detailed description of this file is available at the end of the page.

downloadData.py

Usage: python3.8 downloadData.py benchmarkingParameters.tsv

Downloads reads using sra-downloader (https://github.com/s-andrews/sradownloader). Input: SRA Accession numbers associated with strain names
Downloads and indexes a reference sequence using ncbi-genome-download (https://github.com/kblin/ncbi-genome-download). Input: taxID of the species, group according to the ncbi (fungi, viral, etc.), and species name.

processReads.py [groundTruth]

Usage: python3.8 processReads.py benchmarkingParameters.tsv groundTruth

Generates the ground truth dataset. Maps short reads and long reads, variants calls short reads

hybridGenerator.py

Usage: python3.8 hybridGenerator.py benchmarkingParameters.tsv

Generates virtual hybrids by merging the ground truth dataset reads at specified coverage levels.

processReads.py [virtualHybrids]

Usage: python3.8 processReads.py benchmarkingParameters.tsv virtualHybrids

Runs on the virtual polyploids generated by hybridGenerator.py. Maps short reads and long reads, variants calls and subsets short read VCFs, with and without indels

phaseToolRunner.py

Usage: python3.8 phaseToolRunner.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase [...]

Phases the virtual polyploids generated by hybridGenerator.py and processed by processReads.py [virtualHybrids]. Will run every phasing algorithm specified in the command line. This script also outputs performance metrics (runtime and memory usage). Currently only flopp, nphase and whatshap polyphase are implemented. Please feel free to contribute to the project by adding an implementation of another phasing algorithm you would like to see included.

accuracyCalculator.py

Usage: python3.8 accuracyCalculator.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase [...]

Calculates accuracy metrics of phasing results obtained by running phaseToolRunner.py and outputs them to a file and to stdout.

Example script

With the example benchmarkingParameters.tsv file, you can run a full benchmarking pipeline using 4 strains to generate a virtual 2n, 3n and 4n, at coverage levels 10X/haplotype and 20X/haplotype, and randomly subset their variants to obtain VCF files at 1%, 0.5%, 0.1% and 0.05% heterozygosity, with and without indels. In total it generates 3 ploidies at 2 coverage levels and 4 heterozygosity levels each, with and without indels, for a total of 3*2*4*2=48 virtual polyploids which are all phased by flopp, nphase and whatshap polyphase and their results analyzed in comparison with the known ground truth.

Please note that this can take a few days and over 100GB of hard drive space to run fully. You can modify the benchmarkingParameters.tsv file to reduce the number of tests to perform to test run the phasing toolkit faster and more efficiently. We provide a very complete benchmarking test here as an example to show the range of available options.

python3.8 downloadData.py benchmarkingParameters.tsv
python3.8 processReads.py benchmarkingParameters.tsv groundTruth
python3.8 hybridGenerator.py benchmarkingParameters.tsv
python3.8 processReads.py benchmarkingParameters.tsv virtualHybrids
python3.8 phaseToolRunner.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase
python3.8 accuracyCalculator.py benchmarkingParameters.tsv flopp nphase whatshap-polyphase

Parameter file in detail

benchmarkingParameters.tsv

This file contains all of the information needed to share a full benchmark with anyone else. Here we describe it line by line:

Output folder which will be used to store all of the data we're going to download and generate

mainPath /path/to/outputFolder/

Names of the strains we will phase and the accessions of their long reads, format is strainName:SRA_Accession

longReads ACA:ERR4352153 BMB:ERR4352154 CCN:ERR4352155 CRL:ERR4352156

Define the type of long read (ont or pacbio)

longReadMethod ont

Names of the strains we will phase and the accessions of their long reads, format is strainName:SRA_Accession. The names must be the same as for the long reads.

shortReads ACA:ERR1309429 BMB:ERR1308675 CCN:ERR1308732 CRL:ERR1308952

Coverage levels to test (10 means 10X per haplotype, for a diploid that means a total of 20X, for a tetraploid it means a total of 40X)

coverages 20 10

Heterozygosity levels to test (0.5 means 0.5%, i.e. on average 1 heterosygous SNP every 2000 bp)

heterozygosityRates 0.05 0.1 0.5 1

Reference information, must include the taxID of the species according to the ncbi, the ground name according to the ncbi, and a name for the reference sequence (speciesName). More information can be obtained from https://github.com/kblin/ncbi-genome-download.

reference taxID:559292 group:fungi speciesName:sCerevisiae

Define the virtual polyploids to generate. ACA,BMB,CCN means it should generate a virtual triploid with reads from the strains ACA, BMB and CCN.

strainLists BMB,CCN ACA,BMB,CCN ACA,BMB,CCN,CRL

Reference genome size, needed to calculate coverage

genomeSize 12500000

Number of threads to use when applicable

threads 8

How to contribute

If you're interested in helping with the development of this toolkit, please feel free to create issues on this page or email me at omaroakheart@gmail.com to discuss ways in which we can improve this project and make it more flexible and useful for the community. I'm very interested in extending this to (in no particular order):

Improve the flexibility of tools provided to make the phasing toolkit more versatile (for example reusing the VCF subsetting method to make it easy for anyone to subset their own VCF)
Include more default benchmarking datasets which push the limits of phasing methods in different ways (highly similar haplotypes, aneuploidy, LOH tracts, SVs...)
Develop tools which analyze SVs and phasing results in order to generate accurate haplotig sequences from a reference sequence
Develop tools to take phasing results and generate high-quality de novo phased sequences
Host a simple website which displays the results of different tools on different datasets
Test how different parameters influence accuracy results
Provide more insight into which variants are well-phased
Investigate the effect of read and mapping quality on results
Provide tools to analyze and visualize the results of phased data
Discuss and develop accuracy metrics
Discuss and develop the ideal input and output formats of polyploid phasing algorithms