Parascopy is designed for robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing.
Created by Timofey Prodanov timofey.prodanov[at]hhu.de
and Vikas Bansal vibansal[at]ucsd.edu
at the University of California San Diego.
- Citing Parascopy
- Installation
- General usage
- Visualizing output
- Output files
- Precomputed data
- Test dataset
- Frequently asked questions
- Issues
- See also
If you use Parascopy, please cite:
- Prodanov, T. & Bansal, V. Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing. Nature Communications 13, 3221 (2022). https://doi.org/10.1038/s41467-022-30930-3
- Prodanov, T. & Bansal, V. A multi-locus approach for accurate variant calling in low-copy repeats using whole-genome sequencing. Bioinformatics 39, i279-i287 (2023). https://doi.org/10.1093/bioinformatics/btad268
Parascopy is written in Python (≥ v3.6) and is available on Linux and macOS.
You can install Parascopy using conda
:
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c bioconda parascopy
In most cases, it takes 2-6 minutes to install Parascopy using conda
.
If you have problems with installing Parascopy using conda
, it is possible to create a new conda
environment
(see more details here):
conda create --name paras_env parascopy
conda activate paras_env
Alternatively, you can install it manually using the following commands:
git clone https://github.com/tprodanov/parascopy.git
cd parascopy
python3 setup.py install
Parascopy depends on several Python modules (see here). To install Parascopy without dependencies, you can run
python3 setup.py develop --no-deps
Additionally, you can specify installation path using --prefix <path>
.
In case of manual installation, Parascopy requires
In order to run variant calling (parascopy call
), Parascopy requires a modified Freebayes executable.
It is installed automatically via conda, but for manual installation you need to run
# Within the main parascopy directory.
git clone --recursive https://github.com/tprodanov/freebayes
cd freebayes
./compile.sh
cd ..
See parascopy help
or parascopy <command> --help
for more information.
Additionally, you can find a test dataset and pipeline here.
Parascopy uses a database of duplications - homology table. To construct a homology table you would need to run:
parascopy pretable -f genome.fa -o pretable.bed.gz
parascopy table -i pretable.bed.gz -f genome.fa -o table.bed.gz
Note, that the reference genome should be indexed with both samtools faidx
and bwa index
.
Alternatively, you can download a precomputed homology table.
In order to find aggregate and paralog-specific copy number profiles (agCN and psCN), you can run the following commands:
# Calculate background read depth.
parascopy depth -I input.list -g hg38 -f genome.fa -o depth
# Estimate agCN and psCN for multiple samples.
parascopy cn -I input.list -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1
# Estimate agCN and psCN for single/multiple samples using model parameters from a previous run.
parascopy depth -I input2.list -g hg38 -f genome.fa -o depth2
parascopy cn-using analysis1/model -I input2.list -t table.bed.gz -f genome.fa -d depth2 -o analysis2
Parascopy variant calling in duplicated regions is available since version v1.9
.
Currently, variant calling is performed on top of an existing copy number analysis.
To run variant calling you can execute the following command:
parascopy call -p analysis1 -f genome.fa -t table.bed.gz
Input alignment files should be sorted and indexed, both .bam
and .cram
formats are supported.
Input filenames can be passed into Parascopy as -i in1.bam in2.bam ...
or as -I input-list.txt
,
where input-list.txt
is a text file with a single filename on each line.
Additionally, you can provide/override sample names with
-i in1.bam::sampleA in2.bam::sampleB
or, in case of -I input-list.txt
, as a second entry on each line:
in1.bam sampleA
in2.bam sampleB
After agCN estimation, parascopy searches for reliable PSVs.
However, only samples that are consistent with the reference copy number are used (in a two-copy duplication,
only samples with agCN = 4 will be used).
Additionally, by default, sex chromosomes X and Y are treated as regular chromosomes.
To solve both issues, one can use --modify-ref
argument and provide updated reference copy numbers for some samples
and some regions.
The BED file can contain the following lines:
#CHROM START END SAMPLES CN # This line is not necessary.
chr1 10000 20000 * 0 # It is known that the region is missing from all samples.
chrX 0 inf SAMPLE1,SAMPLE2,SAMPLE3 1 # Male samples have one chrX and one chrY.
chrY 0 inf SAMPLE1,SAMPLE2,SAMPLE3 1
chrY 0 inf SAMPLE4,SAMPLE5,SAMPLE6 0 # Female samples are missing chrY.
It is possible to visualize agCN and psCN detection process.
To do that you need to clone this repository and run scripts in draw
directory.
The scripts are written in R
language and require a number of R
packages:
install.packages(c('argparse', 'tidyverse', 'ggplot2', 'ComplexHeatmap', 'viridis', 'circlize', 'ggthemes', 'RColorBrewer'))
Copy number variation analysis (parascopy cn|cn-using
) produces a range of files, described here.
Variant calling produces regular VCF
files, additional information can be found here.
You can use the following precomputed data:
- Precomputed homology tables: hg19 (v1.2, 25 Mb) and hg38 (v1.7, 41 Mb).
- Precomputed model parameters for five continental populations: hg38 (v1.7, 11 Mb). Model parameters were calculated using 2504 samples from the 1000 genomes project (661 AFR, 503 EUR, 504 EAS, 489 SAS, 347 AMR samples).
- Legacy files (hg38) can be downloaded here: homology table v1.2 (40 Mb) and model parameters v1.2 (11 Mb).
Compatible reference genomes (without the ALT contigs) can be dowloaded from
Note that if the BAM/CRAM files contain ALT contigs, you should provide modified reference copy number values
using --modify-ref
argument, where all ALT contigs have the reference copy number 0.
Additionally, you should use homology table constructed on a reference file with ALT contigs.
You can find the full test pipeline here. Alternatively, you can follow step-by-step instructions below:
First, place reference human genome (hg38), precomputed homology tables and model parameters in data
directory.
For single-sample analysis, we subsampled HG00113 human genome, which can downloaded
here (360 Mb).
Please, index the cram file using samtools index HG00113.cram
.
We start by calculating background read depth, which should take 2-5 minutes:
parascopy depth -i HG00113.cram -f data/hg38.fa -g hg38 -o depth_HG00113
Next, we calculate copy number profiles for 167 duplicated loci using precomputed model parameters
obtained using 503 European samples.
This step takes 10-40 minutes depending on the number of threads (controlled by -@ N
).
parascopy cn-using data/models_v1.2.5/EUR \
-i HG00113.cram -f data/hg38.fa -t data/homology_table/hg38.bed.gz \
-d depth_HG00113 -o parascopy_HG00113
You can analyze a subset of loci by specifying data/models_v1.2.5/EUR/<locus>.gz
,
for example analysis of the SMN1 locus should take less than a minute.
For multi-sample analysis, we extracted reads aligned to the GBA/GBAP1 locus for 503 European samples,
can be downloaded here (195 Mb)
(extract using tar xf GBA.tar.gz
).
Background read depth is already calculated and is located in GBA/1kgp_depth.csv.gz
.
Calculating copy number profiles for the GBA/GBAP1 locus should take around 25-30 minutes using a single core.
parascopy cn -I GBA/input.list -f data/hg38.fa \
-t data/homology_table/hg38.bed.gz -d GBA/1kgp_depth.csv.gz \
-r chr1:155231479-155244699::GBA -o parascopy_GBA
Here, the region is supplied using the -r chr:start-end[::name]
format,
alternatively you can supply regions in a BED file using the -R
argument (optional: fourth column with region names).
Sample output can be found here (251 Mb), output files description can be found here.
You can find FAQ here.
Please submit issues here or send them to timofey.prodanov[at]gmail.com
.
Additionally, you may be interested in these tools: