CuddlySV is a fork of cuteSV with main added feature of somatic structural variant calling.
Long-read sequencing enables the comprehensive discovery of structural variations (SVs). However, it is still non-trivial to achieve high sensitivity and performance simultaneously due to the complex SV characteristics implied by noisy long reads. cuteSV, a sensitive, fast and scalable long-read-based SV detection approach, uses tailored methods to collect the signatures of various types of SVs and employs a clustering-and-refinement method to analyze the signatures to implement sensitive SV detection. Benchmarks on real Pacific Biosciences (PacBio) and Oxford Nanopore Technology (ONT) datasets demonstrate that cuteSV has better yields and scalability than state-of-the-art tools.
pip install git+http://github.com/kpalin/cuddlySV.git@v3.0.0
Note the '.' at the end.
For whole genome ONT data one can call germline variants with command like:
cuddlySV align/My_6217T1_19_1456.phased.raw.cram chm13v2.0_maskedY_rCRS.fa \
sv/My_6217T1_19_1456.phased.raw.cuddlysv.vcf \
sv/cuddlysv_wrk_My_6217T1_19_1456.phased.raw \
--sample My_6217T1_19_1456 \
--threads 10 --min_support 3 \
--genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 \
--min_size 10 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3 \
--report_readid --max_size -1 --retain_work_dir
For tumor samples user should make sure using --max_size -1
to get also large variants, e.g. whole chromosome arm events. For somatic calling, one needs to --retain_work_dir
.
The somatic calling with cuddlySV proceeds in multiple steps
- Single sample calling for each tumor and normal, retaining the work directory
- Creating panel of normals by merging the normal sample SV signatures
- Merging the normal and tumor signatures
- SV calling proper
- Filtering for somatic variants
For somatic calling we need work directories from prior single sample runs for either matched normals or pool of normals if the matching sample is not available. The single sample runs are performed as described above in "Germline/single sample variant calling".
For somatic calling, the input cram/bam files must have readgroup information attached to the reads and must contain exactly one readgroup!
For making a pool of (non matching) normals one needs a (tab separated) list of retained cuddlySV workdirs and output vcf files from the normal samples. When you have the file list_of_normals.tsv
, containing the information, run command
make_panel_of_normals_cleaner.sh -o panel_of_normals/ list_of_normals.tsv
This will include the signatures of all PASS:ing variants in any of the files to the panel_of_normals/
directory.
The tumor and at least one normal sample work directory must be merged with command
merge_work_dirs.sh -o merged_wrk_dir/ -t sv/cuddlysv_wrk_My_6217T1_19_1456.phased.raw -n panel_of_normals/
This will merge all normal samples listed after -n
option with the tumor given in -t
. Multiple normals (e.g. panel of normals and a matching normal) can be included also at this point with difference to make_panel_of_normals_cleaner.sh
being lack of filtering by PASS
variants.
We can use the merged work directory like the directory generated by single sample calling, e.g. with
cuddlySV align/My_6217T1_19_1456.phased.raw.cram chm13v2.0_maskedY_rCRS.fa \
sv/joint.My_6217T1_19_1456.phased.raw.cuddlysv.vcf \
merged_wrk_dir/ \
--sample My_6217T1_19_1456 \
--threads 10 --min_support 3 \
--genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 \
--min_size 10 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3 \
--report_readid --max_size -1 --retain_work_dir --report_readgroup
This command produces a temporary file sv/joint.My_6217T1_19_1456.phased.raw.cuddlysv.vcf
which includes mixture of SV calls from normals and tumor alike and which needs to be further filtered. Notable from above command is --report_readgroup
command needed to track the source samples for each called SV.
Finally we can distinguish the somatic and germline variants.
python add_mapping_tags.py -v sv/joint.My_6217T1_19_1456.phased.raw.cuddlysv.vcf \
-a align/My_6217T1_19_1456.phased.raw.cram \
-o sv/joint.My_6217T1_19_1456.phased.raw.cuddlysv.somatic.vcf
This command will filter out variants that are not present in the tumor, add SOMATIC
tag for variants present only in the tumor, NORMAL_RE
for number of reads (signatures) in the normals supporting this variant (only for germline), DP
, MQ
and MQ0
variants characterizing the region of the variant. It will also recalculated the variant quality add MapQ0
filter for variants with more than 10% of overlapping reads having Mapping Quality zero. The variant quality is recalculated assuming poisson sampling of 5% of variant reads.
There is a Snakefile for pipelineing somatic variant calling in https://github.com/kpalin/cuddlySV/tree/somatic
. The configuration file is in [share/cuddlySV/cuddly_somatic.json](src/somatic/cuddly_somatic.json)
and the Snakefile in [share/cuddlySV/somaticSV.smk](src/somatic/somaticSV.smk)
To run the pipeline, you should first create a configuration file for each tumor like
{
"sample_name": "My_5006T1_19_1460",
"tumor_alignment": "/mnt/cgnano/projects/promethion/ont_pipe_remora/My_5006T1_19_1460/align/My_5006T1_19_1460.phased.raw.cram",
"reference_fasta": "/mnt/cg8/reference-genomes/chm13v2.0_maskedY_rCRS/chm13v2.0_maskedY_rCRS.fa",
"tumor_work_dir": "/mnt/cgnano/projects/promethion/ont_pipe_remora/My_5006T1_19_1460/sv/cutesv_wrk_My_5006T1_19_1460.phased.raw/",
"normal_work_dir": [
"/home/kpalin/software/ONT_hg_somatic/cuteSV/src/somatic/panel_of_normals_105_most/",
"/mnt/cgnano/projects/promethion/ont_pipe_remora/My_5006N_19_1461/sv/cutesv_wrk_My_5006N_19_1461.phased.raw/"
]
}
Now the pipeline producing somatic calls should be run with command below (having correct path for snakefile and the configfile)
snakemake --snakefile share/cuddlySV/somaticSV.smk --configfile cuddly_somatic.json -c 20
For running cuddlySV, you need
- python3
- pysam
- cigar
- numpy
- pyfastx
For somatic calling, in addition
- snakemake
- scipy
- bcftools
- htslib
cuddlySV <sorted.bam> <reference.fa> <output.vcf> <work_dir>
Suggestions
For PacBio CLR data: --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 200 --diff_ratio_merging_DEL 0.5
For PacBio CCS(HIFI) data: --max_cluster_bias_INS 1000 --diff_ratio_merging_INS 0.9 --max_cluster_bias_DEL 1000 --diff_ratio_merging_DEL 0.5
For ONT data: --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3 For force calling: --min_mapq 10
Parameter | Description | Default |
---|---|---|
--threads | Number of threads to use. | 16 |
--batches | Batch of genome segmentation interval. | 10,000,000 |
--sample | Sample name/id | NULL |
--retain_work_dir | Enable to retain temporary folder and files. | False |
--report_readid | Enable to report supporting read ids for each SV. | False |
--max_split_parts | Maximum number of split segments a read may be aligned before it is ignored. All split segments are considered when using -1. (Recommand -1 when applying assembly-based alignment.) | 7 |
--min_mapq | Minimum mapping quality value of alignment to be taken into account. | 20 |
--min_read_len | Ignores reads that only report alignments with not longer than bp. | 500 |
--merge_del_threshold | Maximum distance of deletion signals to be merged. | 0 |
--merge_ins_threshold | Maximum distance of insertion signals to be merged. | 100 |
--min_support | Minimum number of reads that support a SV to be reported. | 10 |
--min_size | Minimum length of SV to be reported. | 30 |
--max_size | Maximum size of SV to be reported. Full length SVs are reported when using -1. | 100000 |
--genotype | Enable to generate genotypes. | False |
--gt_round | Maximum round of iteration for alignments searching if perform genotyping. | 500 |
-Ivcf | Optional given vcf file. Enable to perform force calling. | NULL |
--max_cluster_bias_INS | Maximum distance to cluster read together for insertion. | 100 |
--diff_ratio_merging_INS | Do not merge breakpoints with basepair identity more than the ratio of default for insertion. | 0.3 |
--max_cluster_bias_DEL | Maximum distance to cluster read together for deletion. | 200 |
--diff_ratio_merging_DEL | Do not merge breakpoints with basepair identity more than the ratio of default for deletion. | 0.5 |
--max_cluster_bias_INV | Maximum distance to cluster read together for inversion. | 500 |
--max_cluster_bias_DUP | Maximum distance to cluster read together for duplication. | 500 |
--max_cluster_bias_TRA | Maximum distance to cluster read together for translocation. | 50 |
--diff_ratio_filtering_TRA | Filter breakpoints with basepair identity less than the ratio of default for translocation. | 0.6 |
--remain_reads_ratio | The ratio of reads remained in cluster to generate the breakpoint. Set lower to get more precise breakpoint when the alignment data have high quality but recommand over 0.5. | 1 |
-include_bed | Optional given bed file. Only detect SVs in regions in the BED file. | NULL |
--report_readgroup | Append readgroup id to reported read names. Necessary for downstream somatic calling. | False |
cuddlySV (v3.0.0)
This is first version of cuddlySV, a significant update of cuteSV. There are several changes resulting in output different from cuteSV:
- Skip supplementary alignments mostly overlapping primary in opposite strand. In ONT data these are likely two strands of the same molecule sequenced in same read.
- Continue work from previously retained workdir.
- Option to add readgroup from alignment file to the read names.
- Reported position is median (not mean) of the read signal positions
- CIPOS and CILEN report the minimum-maximum span around POS and SVLEN (not 95% Gaussian confidence interval.)
- Post processing scripts for somatic calling using either matched normal or panel of normals control.
cuteSV (v2.0.3):
- Fix the error of missing min_size parameter.
- Fix the missing signatures in duplication clustering.
cuteSV (v2.0.2):
- Fix several errors in signature extraction.
- Filter low quality reads in the statistics of reference reads.
- Modify the rule of merging signatures on the same read.
- Modify the cluster rule of insertions and deletions in force calling.
cuteSV (v2.0.1):
- Fix an error in handling strand in force calling.
- Speed up the genotype module of discovery calling. The comparison results on various datasets are as follows. | | cuteSV | cuteSV2 | | |(previous)| (latest) | | CCS | 900.37s | 261.77s | | CLR | 3620.00s | 2644.94s | | ONT | 2893.08s | 1264.26s |
cuteSV (v2.0.0):
- Upgrate force calling module.
- Add --remain_reads_ratio parameter in order to generate highly accurate record by discarding a few signatures.
- Fix several bugs in inversion and translocation calling.
- Remove the redundant operations in the signature extraction and accelerate the whole analysis.
- Streamline the translocation output when performing force-calling.
- Modify the signature matching rule.
- Modify the sequence of the inserted allele.
cuteSV (v1.0.13):
- Modify the breakpoints of alternative allele and reference allele.
- Fix an initialization error that will reproduce wrong diploid-assembly-based SV call.
cuteSV (v1.0.12):
- Add Allele frequency (AF) info in the outputs.
- Fix an index error when force calling BND variants.
- Modify the parameter of --max_size and enable to report full length of SVs.
cuteSV (v1.0.11):
- Add a script for post-processing typically cuteSV callsets from assembly-based alignments to generate the diploid-assembly-based SV calls.
- Give a wiki page for helping uses to achieve assembly-based SV calling.
- Improve acquirement of inserted sequence in a read whose primary alignment contains hardclips.
- Improve the performance of force calling.
- Enable cuteSV to output allele sequences when performing force calling with the VCF generated from other callers.
- Fix bugs to avoid the error raised by abnormal SV type.
- Update the sort commands used in cuteSV.
- Update the parameter of --max_split_parts.
cuteSV (v1.0.10):
- Fix a bug leading to calculate wrong TRA positions.
- Add a file format conversion script that enable to transfer the vcf file to bedpe file.
- Involve several clustering-and-refinement strategies in force calling function.
- Assessed the performance of force calling with Giab HG002 sample datasets (including CLR, CCS, and ONT platforms).
cuteSV (v1.0.9):
- Change 0-based pos into 1-based pos in DUP in order to support bcftools conversion.
- Correct REF and ALT fields. Adjust END value of INS to make it equal to the value of POS.
- Improve the description of errors.
- Add usegalaxy.eu badge.
- Remove CHR2 and the corresponding END position on the BND call.
- Skip generating empty signature file and rewrite the job schedule.
- Add force calling function and enable cuteSV to perform population-based SV calling.
- Fix several minor bugs.
cuteSV (v1.0.8):
- Rewirte the function of ins/del signatures clustering.
- Update the recommandation parameters for different sequencing datasets.
- Replace
/ with its variant allele sequence, which needs the reference genome sequence as input. - Fix several bugs.
cuteSV (v1.0.7):
- Add read name list for each SV call.
- Fix several descriptions in VCF header field.
cuteSV (v1.0.6): 1.Improvement of genotyping by calculation of likelihood. 2.Add variant quality value, phred-scaled genotype likelihood and genotype quality in order to filter false positive SV or quality control. 3.Add --gt_round parameter to control the number of read scans. 4.Add variant strand of DEL/DUP/INV. 5.Fix several bugs.
cuteSV (v1.0.5): 1.Add new options for specificly setting the threshold of deletion/insertion signals merging in the same read. The default parameters are 0 bp for deletion and 100 bp for insertion. 2.Remove parameter --merge_threshold. 3.Fix bugs in inversion and translocation calling. 4.Add new option for specificly setting the maximum size of SV to be discovered. The default value is 100,000 bp.
cuteSV (v1.0.4): 1.Add a new option for specificly setting the threshold of SV signals merging in the same read. The default parameter is 500 bp. You can reduce it for high-quality sequencing datasets like PacBio HiFi (CCS). 2.Make the genotyping function optional. 3.Enable users to set the threshold of SV allele frequency of homozygous/heterozygous. 4.Update the description of recommendation parameters in processing ONT data.
cuteSV (v1.0.3): 1.Refine the genotyping model. 2.Adjust the threshold value of heterozygosis alleles.
cuteSV (v1.0.2): 1.Improve the genotyping performance and enable it to be default option. 2.Make the description of parameters better. 3.Modify the header description of vcf file. 4.Add two new indicators, i.e., BREAKPOINT_STD and SVLEN_STD, to further characterise deletion and insertion. 5.Remove a few redundant functions which will reduce code readability.
Jiang T et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol 21, 189 (2020). https://doi.org/10.1186/s13059-020-02107-y
Cao S et al. Re-genotyping structural variants through an accurate force-calling method. bioRxiv 2022.08.29.505534; doi: https://doi.org/10.1101/2022.08.29.505534
For advising, bug reporting and requiring help, please post on Github Issue.