SMN1, the gene that causes spinal muscular atrophy, is considered a 'dark' region of the genome due to high sequence similarity with its paralog SMN2. Paraphase is a Python tool that takes HiFi BAMs as input (whole-genome or enrichment), phases complete SMN1 and SMN2 haplotypes, determines copy numbers and makes phased variant calls for both genes. It also categorizes the haplotypes, enabling future haplotype-based screening of silent carriers (2+0). Please check out our paper for more details about the method and our population-wide haplotype analysis.
Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;0(0). doi:10.1016/j.ajhg.2023.01.001
For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when haplotypes are highly similar to each other in Exons 1-6. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.
Currently Paraphase only works on GRCh38. Support for GRCh37 will be adde in the future.
If you need assistance or have suggestions, please don't hesitate to reach out by email or open a GitHub issue.
Xiao Chen: xchen@pacificbiosciences.com
Paraphase can be installed through pip or conda:
pip install paraphase
# or
conda install -c conda-forge -c bioconda paraphase
Alternatively, Paraphase can be installed from GitHub.
git clone https://github.com/PacificBiosciences/paraphase
cd paraphase
python setup.py install
paraphase -b input.bam -o output_directory
Alternatively when you have a list of bam files
paraphase -l list.txt -o output_directory
Required parameters:
-b
: Input BAM file or-l
: List of BAM files (one per line)-o
: Output directory
Optional parameters:
-v
: If specified, Paraphase will produce VCFs for each haplotype.-c
: Config file, default config file isparaphase/data/smn1/config.yaml
.-t
: Number of threads, used when-l
is specified.-d
: File listing average genome depth per sample, with two columns, sample ID and depth values, separated by tab or space. This saves run time by skipping the step to calculate genome depth.--samtools
--minimap2
The paths to samtools and minimap2 can be provided through the --samtools
and --minimap2
parameters or by modifying the tools
section of the config file.
Note that currently only GRCh38 is supported. We will support GRCh37 in the future if there is request.
Paraphase produces a few output files in the directory specified by -o
, with the sample ID as the prefix.
_realigned_tagged.bam
: This BAM file can be loaded into IGV for visualization of haplotypes, see haplotype visualization.- If
-v
is specified, Paraphase will generate VCF files. A VCF file is written for each haplotype, and there is also a_variants.vcf
file containing merged variants from all haplotypes. .json
: Main output file, summerizes haplotypes and variant calls for each sample. Details of the fields are explained below:smn1_cn
: copy number of SMN1, anull
call indicates that Paraphase finds only one haplotype but depth does not unambiguously support a copy number of one or two.smn2_cn
: copy number of SMN2, anull
call indicates that Paraphase finds only one haplotype but depth does not unambiguously support a copy number of one or two.smn2_del78_cn
: copy number of SMN2Δ7–8 (SMN2 with a deletion of Exon7-8)smn1_read_number
: number of reads containing c.840Csmn2_read_number
: number of reads containing c.840Tsmn2_del78_read_number
: number of reads containing the known deletion of Exon7-8 on SMN2smn1_haplotypes
: phased SMN1 haplotypessmn2_haplotypes
: phased SMN2 haplotypessmn2_del78_haplotypes
: phased SMN2Δ7–8 haplotypestwo_copy_haplotypes
: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.haplotype_details
: lists information about each haplotypevariants
: The variants contained in the haplotype, excluding those in homopolymer regions. For a complete set of variant calls, please use the-v
option.boundary
: The boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.haplogroup
: The haplogroup that the haplotype is assigned to