/princess

Primary LanguagePythonMIT LicenseMIT

Princess

GitHub GitHub last commit

Princess is a fast and scalable framework to detect and report haplotype resolved Single Nucleotide Variants (SNV) and Structural Variations (SVs) at scale. It can leverage your cluster environment to speed up the detection which starts with one or many fasta or fastq files. Publication: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02486-w

princess

Princess

  • Mapping: NGMLR or Minimap2
  • SNVs: Clair (successor of Clairvoyante)
  • SVs: Sniffles
  • Phasing SNVs: WhatsHap
  • Phasing SVs: PRINCESS-subtool
  • Extend Phasing: PRINCESS-subtool
  • Phased Methylation: Nanopolish + PRINCESS-subtool
  • QC Statistics for each step

Installation

Princess was tested on CentOS release 6.7, Conda version 4.7.12 is installed: for more information about Installing Conda press here To download same Conda version here*

  1. After conda is installed. Snakemake should be installed and yaml
conda install snakemake=5.7.1
conda install pyyaml
  1. Downloading PRINCESS
git clone https://github.com/MeHelmy/princess.git
  1. Install Clair, Training models, pypy, and intervaltree
cd princess
chmod +x install.sh
./install.sh

Tutorial

To have an overview about princess write command princess -h. You will have the following list of commands that we can use in princess.

usage: princess [-h] {all,align,sv,snv,variant,phase,overview} ...

Princess A framework for long-reads analysis.

optional arguments:
  -h, --help            show this help message and exit

Sub-commands:
  Valid sub-commands

  {all,align,sv,snv,variant,phase,overview}
    all                 This command will run the following: Align the reads.
                        Identify SVs Identify SNVs Phase both SNVs and SVs
    align               This command will use the input sequence files and
                        align them against the reference using either Minimap2
                        or NGMLR use -a to choose aligner otherwise Minimap2
                        will be used by default.
    sv                  This command will use bam file to identify SV using
                        Sniffles.
    snv                 This command will use bam file to identify SNVs usin
                        Clair.
    variant             This command will use bam file to identify SVs and
                        SNVs.
    phase               This command will use use reads to identify SNVs by
                        Clair and Phase them.
    overview            This command will show what steps will run.

princess version 0.01. use command -h for info.

Assume that we want only to run snv command, to know more about its option:

princess snv -h

usage: princess snv [-h] [-v] -d Working directory -r {ont,clr,ccs} [-l] [-u]
                    [-e] [-a {minimap,ngmlr}]
                    [-s sampleFiles [sampleFiles ...]] -f REF [-j JOBS]
                    [-g LOG_FILE] [-c CHRS [CHRS ...]] [-t]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d Working directory, --directory Working directory
                        Working directory.
  -r {ont,clr,ccs}, --ReadType {ont,clr,ccs}
                        Read techonlogy
  -l, --removeFiles     remove princess source script after running default:
                        False)
  -u, --UseConda        Use conda for running default: True)
  -e, --Cluster         Use cluster while runing default: True)
  -a {minimap,ngmlr}, --Aligner {minimap,ngmlr}
                        In case if you want to choose specific aligner
                        otherwise default will be used default: minimap)
  -s sampleFiles [sampleFiles ...], --sampleFiles sampleFiles [sampleFiles ...]
                        list of fatsa, fastq, or gz files.
  -f REF, --ref REF     The reference file will be used to align reads to.
  -j JOBS, --jobs JOBS  Number of running jobs default: 200 )
  -g LOG_FILE, --log LOG_FILE
                        Log file: PrincessLog.txt )
  -c CHRS [CHRS ...], --chr CHRS [CHRS ...]
                        Chromosomes list, if not specified we will use all
                        Chromosomes.
  -t, --filter          Filter identified SNVs using Princess algorithm
                        default: True)
princess all  -d ./princess_all -r ont -s reads.split00.fastq.gz reads.split01.fastq.gz  -f hs37d5_mainchr.fa

-r defines the reads type.
-s samples that we would like to analyze.
-f full path to the reference.

Note
I am assuming that the reference file is indexed, if not please use the following command.
samtools faidx hs37d5_mainchr.fa as a result you will have hs37d5_mainchr.fa.fai.

Done!!

For methylation calling.

Methylation calling is a part from the all option.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d Working directory, --directory Working directory
                        Working directory.
  -r {ont,clr,ccs}, --ReadType {ont,clr,ccs}
                        Read techonlogy
  -l, --removeFiles     remove princess source script after running default: False)
  -u, --UseConda        Use conda for running default: True)
  -e, --Cluster         Use cluster while runing default: True)
  -a {minimap,ngmlr}, --Aligner {minimap,ngmlr}
                        In case if you want to choose specific aligner otherwise default will be used default: minimap)
  -s sampleFiles [sampleFiles ...], --sampleFiles sampleFiles [sampleFiles ...]
                        list of fatsa, fastq, or gz files.
  -f REF, --ref REF     The reference file will be used to align reads to.
  -j JOBS, --jobs JOBS  Number of running jobs default: 200 )
  -g LOG_FILE, --log LOG_FILE
                        Log file: PrincessLog.txt )
  -c CHRS [CHRS ...], --chr CHRS [CHRS ...]
                        Chromosomes list, if not specified we will use all Chromosomes.
  -t, --filter          Filter identified SNVs using Princess algorithm default: True)
  -m, --methylation     Identify methylation, mutually inclusive with -md default: False)
  -md Fast5 Directory, --methylationDirectory Fast5 Directory
                        Fast5 directory will be used to identify methylation mutually inclusive with option -m default: False)

By choosing the flag --methylation, Princess will call the methylation on the input data (ONT data), this option is inclusive with the option --methylationDirectory which requires the fasta5 directory.

Test case

We uploaded a HiFi compressed data file from the publically available HG002 data set. The complete data set (High-fidelity 15kb long-read dataset of HG002, Ashkenazim Son.) is available Here

To download the test data run the following command:

wget https://bcm.box.com/shared/static/sdml5d7csxprgu3cl5cve0lgv5jnrrlv --output-document  HiFi.fastq.gz

After download is finished you shall have a HiFi fastq file called HiFi.fastq.gz, to run the analysis test run the following command:

Full/Path/To/princess all  --directory $PWD/analysis --ReadType ccs --ref Path/To/Reference/genome.fa --jobs 7 --sampleFiles $PWD/HiFi.fastq.gz  --latency-wait 200 -p

all: The command to run full analysis for other options please run princess -h
---directory: The out put directory it could be any name, use the full path, in my case the output is same place.
--ReadType: Read type, the supported read types are clr, ccs, and ont.
--ref: Path to the reference please use samtools faidx with refernce before running Princess.
--jobs: Number of running jobs on cluster.
--sampleFiles: Sample fastq file we downloaded, it could be more than one either compressed or not.
--latency-wait 200 -p: These are additional Snakemake option to wait 200 seconds before collecting output.

Output

Princess will create these directories:

  • align contains directory [minimap or ngmlr] based on the aligner that was specified.
  • sv contains the structural variant file sv/minimap/sniffles.vcf
  • snp contains single nucleotide variant calls per chromosomes
  • phased contains phased variant
  • stat contains Statistics
  • meth contains methylation info (if user choose to run methylation)

Collect benchmark Statistics

cd benchmark  # There is a directory benchmark contains all the analyses that were done by PRINCESS
find "$PWD" -type f | grep -v "myBenchMark.txt" > myBenchMark.txt
while read -r line; do n=$(echo  $line | awk  -v FS=/ '{print $(NF-1)"-"$(NF)}');  awk -v f=$line -v o=$n 'NR!=1 {print o"\t"$(NF)}' $line  ;done < myBenchMark.txt