/SAPA

SNP annotation programm for AML

Primary LanguageHTML

SAPA - SNP annotation programm for AML

75% of a Acute myeloid leukemia (AML) patients SNPs are unique to them and their impact is still unresolved.

SAPA is designed to add additional information to an Illumina truseq amplicon variants csv file. Such as Scores for nonsynonymous scoring matrices, divided into 3 different combined Scores, function prediction scores, conservation scores and one ensemble score. So that the viewer gets more Information whether a mutation is deleterious or tolerated. Due to performance and data reasons, SAPA uses annovar1 and their precomputed SNP database, to quickly gather the deleterious prediction methods scores.

Update:

Now with HTML heatmap output! HTML output

First run

If you don't have PiP and Dominate you have to install them, before you can run SAPA, so type:

sudo apt-get install python-pip sudo pip install dominate

To get started simply type in:

./main.py -i example_data/truseq_example_data.csv

this will run the program with the example dataset. Your output will be saved in the output.txt and output.html files.

Please Note: During the first run of the program, the required database will be downloaded. This may take some time, depending on your internet connection. You may want to use the -fast argument to minimize the download size.

Parameter

Markdown spport in GIT...

-h, --help : show the help message and exit

-q, --quiet : prevent output in command line, run the program and don't bother

-m, --manual : display the manual for this program

-i, --input_file : truseq amplicon table containing SNPs > e.g. -i data/truseq_example_data.csv

-d, --detail : write detailed output file, with all the single scores of the deleteriousness prediction methods for nonsynonymous SNVs and amino acid substitution (if any)

-s, --separator : set the input file separator (default: ",") > e.g. -s ";" this will set the column separator to a semicolon

-t, --text_delimiter : set the input text delimiter (default: ") > e.g. -t "'" this will set the text delimiter to an apostrophe, if a column contains multiple entries

-b, --buildver : buildversion e.g hg19 (GRCh37) or hg38 (GRCh38) database (default hg19) > e.g. -b hg38 set the reference genome to hg38

-o, --output_file : output file name (default output.txt) > e.g. -o annotated_snps_detailed.txt -d running the detailed operation mode, and saving it in the file annotated_snps_detailed.txt

-f, --fast : run annotation just with a region based approach, for faster computing and less download file demand

-fi, --filter : filter all entries and only use nonsynonymus and clinically significant (>95%) SNPs in the output file

Features

Added data to a SNP csv file

No worries your input file will not be overwritten, a new file with the data from the input file will be created with the following scores added.:

  • function prediction scores
  • conservation scores
  • ensemble score
  • final prediction

Scoring explained (dbnsfp30a)

Databases

All required databases will be downloaded in the hg19/ directory. If some error during downloading occurs, you can read furhter details in the annovar_downdb.log in the hg19/ folder. Several commonly used databases are integrated:

  • cytoBand

    • for the chromosome coordinate of each cytogenetic band
  • exac03

    • for the variants reported in the Exome Aggregation Consortium (version 0.3)
  • dbnsfp30a

    • for various functional deleteriousness prediction scores from the dbNSFP database
  • clinvar_20140929

    • for the variants reported in the ClinVar database (version 20140929).

For functional prediction of variants in whole-genome data:

  • GERP++

    • functional prediction scores for 9 billion mutations based on selective constraints across human genome. You can optionally use gerp++gt2 instead since it includes only RS score greater than 2, which provides high sensitivity while still strongly enriching for truly constrained sites
  • CADD

    • Combined Annotation Dependent Depletion score for 9 billion mutations. It is basically constructed by a support vector machine trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants, using ~70 different features.
  • DANN

    • functional prediction score generated by deep learning, using the identical set of training data as cadd but with much improved performance than CADD.
  • FATHMM

    • a hidden markov model to predict the functional importance of both coding and non-coding variants (that is, two separate scores are provided) on 9 billion mutations.
  • EIGEN

    • a spectral approach integrating functional genomic annotations for coding and noncoding variants on 9 billion mutations, without labelled training data (that is, unsupervised approach)
  • GWAVA

    • genome-wide annotation of variants that supports prioritization of noncoding variants by integrating various genomic and epigenomic annotations on 9 billion mutations.

For functional prediction of variants in whole-exome data:

  • dbnsfp30a
    • this dataset already includes SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, DANN, fitCons, PhyloP and SiPhy scores, but ONLY on coding variants
    • for more information check: http://varianttools.sourceforge.net/Annotation/DbNSFP

Input File

As input file is a Illumina truseq amplicon variants file needed, which contains the patients SNPs in a character separated format. A example file is included in the /data directory and listed below.

Pos Ref Alt Type Context Consequence dbSNP COSMIC ClinVar Qual Alt Freq Total Depth Ref Depth Alt Depth Strand Bias
25463566 C A SNV Coding missense_variant 100 0.334 6568 4358 2194 -100.0
128204951 C T SNV Coding,Intergenic missense_variant,upstream_gene_variant rs2335052 COSM445531 100 0.549 1312 587 720 -100.0
128205860 G C SNV Coding,Intergenic synonymous_variant,upstream_gene_variant rs1573858 100 0.997 1925 0 1920 -100.0
55141055 A G SNV Coding synonymous_variant rs1873778 COSM1430082 100 0.997 12713 35 12670 -100.0
106196829 T G SNV Coding missense_variant rs34402524 COSM87176 100 0.482 38535 19801 18585 -100.0
106196937 AT A Deletion Coding frameshift_variant,feature_truncation 100 0.385 31380 19297 12083 -100.0
101917521 G A SNV rs803064 100 0.558 9023 3973 5033 -100.0
101921289 A G SNV rs2230103,rs79334660 100 0.37 2903 1830 1073 -100.0
21971127 A G SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 100 0.172 157 120 27 -100.0
21971127 A T SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 44 0.057 157 120 9 -100.0
21971129 T G SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,missense_variant 44 0.057 158 143 9 -100.0
21971130 G A SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 68 0.074 163 144 12 -100.0
21971131 G A SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,missense_variant COSM13766 44 0.057 157 144 9 -100.0
7579472 G C SNV Coding missense_variant rs1042522 COSM45985 non-pathogenic 100 0.591 3567 1452 2108 -100.0
74733099 G A SNV Intergenic,Coding,Intron,5P_UTR downstream_gene_variant,upstream_gene_variant,synonymous_variant,intron_variant,5_prime_UTR_variant rs237057 100 0.995 7558 34 7518 -100.0
31022959 T C SNV Coding missense_variant rs6058694 100 0.997 17877 39 17831 -100.0
36252877 C T SNV Coding missense_variant COSM96546 100 0.465 8562 4573 3978 -100.0
39933339 A G SNV Coding synonymous_variant rs5917933 100 0.998 14616 18 14592 -100.0
123195650 A T SNV Coding missense_variant 47 0.073 109 100 8 -100.0

Output File (default)

The generated outputfile output.html, will be in the current working directory, pelase note that if you move the output.html to a different folder, you need to copy the resources folder too. Also the output.txt, which contains the patients annotated SNPs in a character separated format, will be in the current working directory. Using the included example dataset in the /data directory, the following output will be generated.

ID Chr Pos Ref Alt Type Context Consequence dbSNP COSMIC ClinVar Qual Alt Freq [%] Total Depth Ref Depth Alt Depth Strand Bias Gene function prediction scores [0-1] conservation scores[-12.3-6.17] ensemble scores[0-60] final prediction
0 chr2 25463566 C A SNV Coding missense_variant 100 0,334 6568 4358 2194 -100 DNMT3A 0,863 5,76 34 Deleterious
1 chr3 128204951 C T SNV Coding,Intergenic missense_variant,upstream_gene_variant rs2335052 COSM445531 100 0,549 1312 587 720 -100 GATA2 0 1,83 13,22 Tolerated
2 chr3 128205860 G C SNV Coding,Intergenic synonymous_variant,upstream_gene_variant rs1573858 100 0,997 1925 0 1920 -100 GATA2 0 0 0 Tolerated (synonymous)
3 chr4 55141055 A G SNV Coding synonymous_variant rs1873778 COSM1430082 100 0,997 12713 35 12670 -100 PDGFRA 0 0 0 Tolerated (synonymous)
4 chr4 106196829 T G SNV Coding missense_variant rs34402524 COSM87176 100 0,482 38535 19801 18585 -100 TET2 0 5,16 21,5 Tolerated
5 chr4 106196937 T -0 Deletion Coding frameshift_variant,feature_truncation 100 0,385 31380 19297 12083 -100 TET2 0 0 0 0
6 chr7 101917521 G A SNV rs803064 100 0,558 9023 3973 5033 -100 CUX1 0 -0,106 5,82 Tolerated
7 chr7 101921289 A G SNV rs2230103,rs79334660 100 0,37 2903 1830 1073 -100 CUX1 0,004 3,4 7,655 Tolerated
8 chr9 21971127 A G SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 100 0,172 157 120 27 -100 CDKN2A 0,192 0,609 1,37 Tolerated
9 chr9 21971127 A T SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 44 0,057 157 120 9 -100 CDKN2A 0,178 0,609 0,119 Tolerated
10 chr9 21971129 T G SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,missense_variant 44 0,057 158 143 9 -100 CDKN2A 0,379 5,79 23,7 Tolerated
11 chr9 21971130 G A SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 68 0,074 163 144 12 -100 CDKN2A 0,224 3,74 10,53 Tolerated
13 chr17 7579472 G C SNV Coding missense_variant rs1042522 COSM45985 non-pathogenic 100 0,591 3567 1452 2108 -100 TP53 0 1,87 0,355 Tolerated
12 chr9 21971131 G A SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,missense_variant COSM13766 44 0,057 157 144 9 -100 CDKN2A 0,109 -0,271 10,15 Tolerated
14 chr17 74733099 G A SNV Intergenic,Intergenic,Coding,Intron,5P_UTR downstream_gene_variant,upstream_gene_variant,synonymous_variant,intron_variant,5_prime_UTR_variant rs237057 100 0,995 7558 34 7518 -100 SRSF2 0 0 0 Tolerated (synonymous)
15 chr20 31022959 T C SNV Coding missense_variant rs6058694 100 0,997 17877 39 17831 -100 ASXL1 0 0 0 0
16 chr21 36252877 C T SNV Coding missense_variant COSM96546 100 0,465 8562 4573 3978 -100 RUNX1 0,991 5,31 33 Deleterious
17 chrX 39933339 A G SNV Coding synonymous_variant rs5917933 100 0,998 14616 18 14592 -100 BCOR 0 0 0 Tolerated (synonymous)
18 chrX 123195650 A T SNV Coding missense_variant 47 0,073 109 100 8 -100 STAG2 0,303 4,77 28,6 Tolerated

Output File (detailed)

ID Chr Pos Ref Alt Type Context Consequence dbSNP COSMIC ClinVar Qual Alt Freq [%] Total Depth Ref Depth Alt Depth Strand Bias Chr Start End Ref Alt Func_refGene Gene_refGene GeneDetail_refGene ExonicFunc_refGene AAChange_refGene cytoBand esp6500siv2_all avsnp147 SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred FATHMM_score FATHMM_pred PROVEAN_score PROVEAN_pred VEST3_score CADD_raw CADD_phred DANN_score fathmm_MKL_coding_score fathmm_MKL_coding_pred MetaSVM_score MetaSVM_pred MetaLR_score MetaLR_pred integrated_fitCons_score integrated_confidence_value GERP_RS phyloP7way_vertebrate phyloP20way_mammalian phastCons7way_vertebrate phastCons20way_mammalian SiPhy_29way_logOdds final prediction
0 chr2 25463566 C A SNV Coding missense_variant 100 33,4 6568 4358 2194 -100 chr2 25463566 25463566 C A exonic DNMT3A 0 nonsynonymous SNV DNMT3A:NM_153759:exon14:c,G1549T:p,G517W,DNMT3A:NM_022552:exon18:c,G2116T:p,G706W,DNMT3A:NM_175629:exon18:c,G2116T:p,G706W 2p23,3 0 rs749365376 0 D 1 D 1 D 0 D 1 D 3,675 H -2,3 D -7,86 D 0,979 7,514 34 0,997 0,985 D 0,946 D 0,863 D 0,707 0 5,76 0,791 0,935 0,998 1 18,523 Deleterious
1 chr3 128204951 C T SNV Coding,Intergenic missense_variant,upstream_gene_variant rs2335052 COSM445531 100 54,9 1312 587 720 -100 chr3 128204951 128204951 C T exonic GATA2 0 nonsynonymous SNV GATA2:NM_001145662:exon3:c,G490A:p,A164T,GATA2:NM_032638:exon3:c,G490A:p,A164T,GATA2:NM_001145661:exon4:c,G490A:p,A164T 3q21,3 0,1638 rs2335052 0,429 T 0,013 B 0,025 B 0,004 N 0,729 P -0,175 N -4,32 D -0,43 N 0,117 1,482 13,22 0,99 0,91 D -1,065 T 0 T 0,543 0 1,83 0,598 0,739 0,999 0,999 9,555 Tolerated
2 chr3 128205860 G C SNV Coding,Intergenic synonymous_variant,upstream_gene_variant rs1573858 100 99,7 1925 0 1920 -100 chr3 128205860 128205860 G C exonic GATA2 0 synonymous SNV GATA2:NM_001145662:exon2:c,C15G:p,P5P,GATA2:NM_032638:exon2:c,C15G:p,P5P,GATA2:NM_001145661:exon3:c,C15G:p,P5P 3q21,3 0,6934 rs1573858 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Tolerated (synonymous)
3 chr4 55141055 A G SNV Coding synonymous_variant rs1873778 COSM1430082 100 99,7 12713 35 12670 -100 chr4 55141055 55141055 A G exonic PDGFRA 0 synonymous SNV PDGFRA:NM_006206:exon12:c,A1701G:p,P567P 4q12 0,9589 rs1873778 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Tolerated (synonymous)
4 chr4 106196829 T G SNV Coding missense_variant rs34402524 COSM87176 100 48,2 38535 19801 18585 -100 chr4 106196829 106196829 T G exonic TET2 0 nonsynonymous SNV TET2:NM_001127208:exon11:c,T5162G:p,L1721W 4q24 0,1233 rs34402524 0,003 D 0,794 P 0,754 P 0 0 1 P 0 N 4,35 T 0,03 N 0,232 2,834 21,5 0,932 0,78 D -0,626 T 0 T 0,707 0 5,16 0,991 1,011 0,775 0,124 13,59 Tolerated
5 chr4 106196937 T -0 Deletion Coding frameshift_variant,feature_truncation 100 38,5 31380 19297 12083 -100 chr4 106196937 106196937 T -0 exonic TET2 0 frameshift deletion TET2:NM_001127208:exon11:c,5270delA:p,H1757fs 4q24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 chr7 101917521 G A SNV rs803064 100 55,8 9023 3973 5033 -100 chr7 101917521 101917521 G A exonic CUX1 0 nonsynonymous SNV CUX1:NM_001202544:exon15:c,G1342A:p,A448T,CUX1:NM_001202545:exon15:c,G1252A:p,A418T,CUX1:NM_001202546:exon15:c,G1273A:p,A425T,CUX1:NM_001913:exon16:c,G1390A:p,A464T,CUX1:NM_181500:exon16:c,G1384A:p,A462T 7q22,1 0,575 rs803064 0,902 T 0,229 B 0,052 B 0 0 1 P 1,24 L 1,51 T -1,17 N 0,273 0,313 5,82 0,736 0,84 D -0,953 T 0 T 0,706 0 -0,106 -0,683 -0,839 0,437 0 7,499 Tolerated
7 chr7 101921289 A G SNV rs2230103,rs79334660 100 37 2903 1830 1073 -100 chr7 101921289 101921289 A G exonic CUX1 0 nonsynonymous SNV CUX1:NM_001202544:exon17:c,A1585G:p,I529V,CUX1:NM_001202545:exon17:c,A1495G:p,I499V,CUX1:NM_001202546:exon17:c,A1516G:p,I506V,CUX1:NM_001913:exon18:c,A1633G:p,I545V,CUX1:NM_181500:exon18:c,A1627G:p,I543V 7q22,1 0,0382 rs2230103 1 T 0,139 B 0,07 B 0 0 0,995 N 0,5 N 1,63 T 0,03 N 0,326 0,53 7,655 0,604 0,967 D -1,033 T 0,004 T 0,706 0 3,4 0,697 1,186 0,993 1 10,046 Tolerated
8 chr9 21971127 A G SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 100 17,2 157 120 27 -100 chr9 21971127 21971127 A G exonic CDKN2A 0 nonsynonymous SNV CDKN2A:NM_058195:exon2:c,T274C:p,S92P 9p21,3 0 0 1 T 0 B 0 B 0,471 N 1 D 0 0 -0,88 T 3,69 N 0 -0,153 1,37 0,714 0,169 N -0,976 T 0,192 T 0,677 0 0,609 -0,056 0,149 0,536 0,933 4,699 Tolerated
9 chr9 21971127 A T SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 44 5,7 157 120 9 -100 chr9 21971127 21971127 A T exonic CDKN2A 0 nonsynonymous SNV CDKN2A:NM_058195:exon2:c,T274A:p,S92T 9p21,3 0 0 0,236 T 0,167 B 0,045 B 0,471 N 1 D 0 0 -0,95 T -0,45 N 0 -0,609 0,119 0,96 0,799 D -0,99 T 0,178 T 0,677 0 0,609 -0,056 0,149 0,536 0,933 4,699 Tolerated
10 chr9 21971129 T G SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,missense_variant 44 5,7 158 143 9 -100 chr9 21971129 21971129 T G exonic CDKN2A 0 nonsynonymous SNV CDKN2A:NM_000077:exon2:c,A229C:p,T77P,CDKN2A:NM_001195132:exon2:c,A229C:p,T77P,CDKN2A:NM_058195:exon2:c,A272C:p,H91P 9p21,3 0 0 0,008 D 0,998 D 0,915 D 0,029 N 0,989 N 0,975 L -0,99 T -6,87 D 0 4,105 23,7 0,982 0,938 D -0,261 T 0,379 T 0,677 0 5,79 0,991 1,061 0,464 0,913 10,099 Tolerated
11 chr9 21971130 G A SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,synonymous_variant 68 7,4 163 144 12 -100 chr9 21971130 21971130 G A exonic CDKN2A 0 nonsynonymous SNV CDKN2A:NM_058195:exon2:c,C271T:p,H91Y 9p21,3 0 0 0,197 T 0,417 B 0,158 B 0,029 N 1 N 0 0 -0,89 T -3,72 D 0 0,976 10,53 0,987 0,797 D -0,845 T 0,224 T 0,677 0 3,74 -0,711 -0,197 0,328 0,866 5,67 Tolerated
13 chr17 7579472 G C SNV Coding missense_variant rs1042522 COSM45985 non-pathogenic 100 59,1 3567 1452 2108 -100 chr17 7579472 7579472 G C exonic TP53 0 nonsynonymous SNV TP53:NM_001126118:exon3:c,C98G:p,P33R,TP53:NM_000546:exon4:c,C215G:p,P72R,TP53:NM_001126112:exon4:c,C215G:p,P72R,TP53:NM_001126113:exon4:c,C215G:p,P72R,TP53:NM_001126114:exon4:c,C215G:p,P72R,TP53:NM_001276695:exon4:c,C98G:p,P33R,TP53:NM_001276696:exon4:c,C98G:p,P33R,TP53:NM_001276760:exon4:c,C98G:p,P33R,TP53:NM_001276761:exon4:c,C98G:p,P33R 17p13,1 0,63 rs1042522 0,642 T 0,745 P 0,372 B 0,371 U 1 P 0 N -2,05 D -0,19 N 0,267 -0,415 0,355 0,57 0,361 N -0,929 T 0 T 0,722 0 1,87 0,518 1,045 0,001 0,001 9,773 Tolerated
12 chr9 21971131 G A SNV Intron,Intergenic,Coding intron_variant,NMD_transcript_variant,downstream_gene_variant,missense_variant COSM13766 44 5,7 157 144 9 -100 chr9 21971131 21971131 G A exonic CDKN2A 0 nonsynonymous SNV CDKN2A:NM_000077:exon2:c,C227T:p,A76V,CDKN2A:NM_001195132:exon2:c,C227T:p,A76V 9p21,3 0 0 0,276 T 0,004 B 0,001 B 0 0 1 N 0 0 -0,16 T -1,74 N 0,46 0,913 10,15 0,974 0,098 N -1,03 T 0,109 T 0,677 0 -0,271 -0,011 -0,239 0,309 0,856 6,3 Tolerated
14 chr17 74733099 G A SNV Intergenic,Intergenic,Coding,Intron,5P_UTR downstream_gene_variant,upstream_gene_variant,synonymous_variant,intron_variant,5_prime_UTR_variant rs237057 100 99,5 7558 34 7518 -100 chr17 74733099 74733099 G A exonic SRSF2 0 synonymous SNV SRSF2:NM_001195427:exon1:c,C144T:p,D48D,SRSF2:NM_003016:exon1:c,C144T:p,D48D 17q25,1 0,797 rs237057 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Tolerated (synonymous)
15 chr20 31022959 T C SNV Coding missense_variant rs6058694 100 99,7 17877 39 17831 -100 chr20 31022959 31022959 T C exonic ASXL1 0 nonsynonymous SNV ASXL1:NM_015338:exon12:c,T2444C:p,L815P 20q11,21 0,9999 rs6058694 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 chr21 36252877 C T SNV Coding missense_variant COSM96546 100 46,5 8562 4573 3978 -100 chr21 36252877 36252877 C T exonic RUNX1 0 nonsynonymous SNV RUNX1:NM_001001890:exon2:c,G404A:p,R135K,RUNX1:NM_001122607:exon2:c,G404A:p,R135K,RUNX1:NM_001754:exon5:c,G485A:p,R162K 21q22,12 0 0 0,001 D 0,999 D 0,997 D 0 D 1 D 3,195 M -6,37 D -2,5 D 0,954 7,021 33 0,998 0,989 D 0,994 D 0,991 D 0,722 0 5,31 0,871 0,932 1 1 19,335 Deleterious
17 chrX 39933339 A G SNV Coding synonymous_variant rs5917933 100 99,8 14616 18 14592 -100 chrX 39933339 39933339 A G exonic BCOR 0 synonymous SNV BCOR:NM_001123383:exon4:c,T1260C:p,D420D,BCOR:NM_001123384:exon4:c,T1260C:p,D420D,BCOR:NM_001123385:exon4:c,T1260C:p,D420D,BCOR:NM_017745:exon4:c,T1260C:p,D420D Xp11,4 0,8957 rs5917933 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Tolerated (synonymous)
18 chrX 123195650 A T SNV Coding missense_variant 47 7,3 109 100 8 -100 chrX 123195650 123195650 A T exonic STAG2 0 nonsynonymous SNV STAG2:NM_006603:exon16:c,A1564T:p,I522F,STAG2:NM_001042749:exon17:c,A1564T:p,I522F,STAG2:NM_001042750:exon17:c,A1564T:p,I522F,STAG2:NM_001042751:exon17:c,A1564T:p,I522F,STAG2:NM_001282418:exon17:c,A1564T:p,I522F Xq25 0 0 0,001 D 0,955 P 0,774 P 0 D 1 D 2,66 M 1,51 T -3,81 D 0,906 6,183 28,6 0,989 0,973 D -0,343 T 0,303 T 0 0 4,77 1,062 1,088 0,999 1 13,668 Tolerated

Validation

For validation i used 2023 Pathogenic SNPs from Multiple submitters out of Clinvar and 2000 benign SNPs also from Multiple submitters. Pathogenic file: 1419 deleterious (true positive), 564 tolerated (false negative), (40 miss) Benign file: 153 deleterious (false positives), 1736 tolerated (true negative), (111 miss) This yielded in a MCC of 0.64 (0.78) If you want to validate it yourself, you can use the clinvar datasets in the data/ folder or download a set yourself from clinvar and convert it to the SAPA format with the ClinVar_to_SAPA_parser.py

Known bugs and missing features

longer deletions are not handled correctly and often there are no scores for longer indels available. Also there is no scoring system for non coding variants, this may change later.

References

1 http://annovar.openbioinformatics.org/en/latest/ https://www.gesundheit.gv.at/labor/laborwerte/blutbild/blasten http://annovar.openbioinformatics.org/ http://cadd.gs.washington.edu/download http://www.ensembl.org/info/genome/variation/predicted_data.html http://varianttools.sourceforge.net/Annotation/DbNSFP https://brb.nci.nih.gov/seqtools/colexpanno.html#dbnsfp http://www.illumina.com/products/by-type/clinical-research-products/trusight-myeloid.html http://www.enlis.com/blog/2015/03/17/the-best-variant-prediction-method-that-no-one-is-using/

Literatur

  • Papaemmanuil, Elli; Gerstung, Moritz; Bullinger, Lars; Gaidzik, Verena I.; Paschka, Peter; Roberts, Nicola D. et al. (2016): Genomic Classification and Prognosis in Acute Myeloid Leukemia. In: The New England journal of medicine 374 (23), S. 2209–2221. DOI: 10.1056/NEJMoa1516192
  • Yang, Hui; Wang, Kai (2015): Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. In: Nature protocols 10 (10), S. 1556–1566. DOI: 10.1038/nprot.2015.105
  • Liu, Xiaoming; Jian, Xueqiu; Boerwinkle, Eric (2011): dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. In: Human mutation 32 (8), S. 894–899. DOI: 10.1002/humu.21517
  • Dong, Chengliang; Wei, Peng; Jian, Xueqiu; Gibbs, Richard; Boerwinkle, Eric; Wang, Kai; Liu, Xiaoming (2015): Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. In: Human molecular genetics 24 (8), S. 2125–2137. DOI: 10.1093/hmg/ddu733
  • Wakita, S.; Yamaguchi, H.; Ueki, T.; Usuki, K.; Kurosawa, S.; Kobayashi, Y. et al. (2016): Complex molecular genetic abnormalities involving three or more genetic mutations are important prognostic factors for acute myeloid leukemia. In: Leukemia 30 (3), S. 545–554. DOI: 10.1038/leu.2015.288.

Written with StackEdit.