Structure variants (SVs), such as inversions, duplications and large deletion, have complex genetic profiles as they vary in size, architecture and genomic context and read alignment accuracy is compromised near SVs. Multiple computational tools have been developed for calling SVs, but without effective and generalizable filtering tools. We developed a bioinformatics tool kit consisting of two major modules: the first annotation component to annotate given SV calls with various genomics features and polymorphism CNVs with the assumption that the SVs overlapping with polymorphism CNVs are likely true calls, the second module used boosted regression to construct a predictive model of SV filtering/prioritization based on these features.
Please make sure that SVs generated by SV callers, such as LUMPY, follows standard VCF4.2 specification. After SV calling procedure, we firstly examine whether a certian SV is known or novel by overlapping with published polymorphic CNV database and 1000 genome database. At the same time, each SV is annotated with the following attributes:
- PE: number of paired end reads support
- SR: number of split reads support
- GERMLINE_ARTIFACT: integer indicating potential germline artifact
- POLYMORPHIC_CNVMAP: 0/1 indicating whether the same SV is a known SV based on polymophic CNV database
- POLYMORPHIC_CNVR: 0/1 indicating whether the same SV is a known SV based on 1k genome database
- L_START_STRICTMASK & L_END_STRICTMASK: average of strict mask L 100bp around SV start/end position (L indicates depth of coverage is much lower than average)
- H_START_STRICTMASK & H_END_STRICTMASK: average of strict mask H 100bp around SV start/end position (H indicates depth of coverage is much higher than average)
- Z_START_STRICTMASK & Z_END_STRICTMASK: average of strict mask Z 100bp around SV start/end position (Z indicates too many reads with zero mapping quality overlap this position)
- Q_START_STRICTMASK & Q_END_STRICTMASK: average of strict mask Q 100bp around SV start/end position (Q indicates average mapping quality is too low)
You will need to have perl and bedtools installed on your system for annotation purpose. To annotate SVs, firstly download all required annotation files under annotation_files directory to your local directory. Extract germline_artifacts.7z and StrictMask.hg38.fa.7z to your local directory. The annotation files will roughly consumes 3.9G of your storage space. Then, use the perl script under inst/script folder to annotate the vcf file:
perl annotate_vcf.pl -i path/to/input.vcf -o path/to/output_table.txt -g path/to/germline_artifacts.txt -c path/to/CNVMAP.bed -v path/to/CNVR.bed -d path/to/segmental_dups.bed -f path/to/Homo_sapiens.refFlat -m path/to/strictmask.fa -e path/to/bedtools
For trio samples, Please see the help page for annotate_vcf.pl for more details. Following the annotation procedure, you then need to run DVboost algorithm to perform filtering based on the annotation.
You can download DVboost from github, you need to install devtools if you haven't done so:
install.packages('devtools')
devtools::install_github('Liuy12/DVboost')
library(DVboost)
Firstly, you need to load the annotated SVs file. loadVariants function will load and reformat the annotated SVs file. Average strict masks based on SVs start and end position will be calculated seperatedly for for L,H,Z,Q:
dataMat <- loadVariants('path/to/input/SVs')
For demonstration purpose, we will load the example annotated SVs file:
data(ExampleData, package='DVboost')
str(ExampleData)
We can take a look at the composition of SV types and whether they have been previously identified with polymophic CNV database
table(ExampleData$SVType, ExampleData$CNVMAP)
As you can see, there are limited number of known SVs for BND, DUP & INV SV types. We will use DEL SV type to train the model and then test on BND, DUP & INV SV types. Annotated polymophic SVs will be treated as true positives and novel SVs will be treated as true negatives.
DEL_svtype <- ExampleData[ExampleData$SVType=='DEL',]
trueSVs <- as.numeric(DEL_svtype$CNVMAP ==1 | DEL_svtype$CNVR ==1)
table(trueSVs)
Then we will train the model using DEL subset of annotated SVs.
outdir <- getwd()
DVb.res <- runDVboostwrapper( var.atr.mtx = DEL_svtype, var.ID.vec = rownames(DEL_svtype),
is.known.var.vec = trueSVs,
output.DIR.name = outdir, input.sample.ID='NA12878', bySVlength=F)
Finally, we will apply the trained model to evaluate the remaining set of SVs including BND, DUP & INV SV types.
outmat <- DVboostQscore(DVb.res, ExampleData)
head(outmat)