imgag/ngs-bits

Replace `vcfbreakmulti` with ngs-bits VcfBreakMulti

Opened this issue · 1 comments

Vcflib vcfbreakmulti doesn't handle annotations correctly during splitting

  • create megSAP test for this error (e.g. calling on the extracted region of DNA2106177 shown below)
  • test if VcfBreakMulti from ngs-bits works properly (also test on whole WGS sample, e.g. NA12878_45)
  • benchmark time (on whole WGS sample, e.g. NA12878_45)
  • implement phased genotype consideration
  • make tests for different phased genotype constellations
  • replace vcflib tool by ngs-bits tool in megSAP
  • merge with master

Example:
tabix -h /mnt/storage2/projects/diagnostic/Genome_Diagnostik/Sample_DNA2106177A1_02/dragen_variant_calls/DNA2106177A1_02_dragen.vcf.gz chr1:4772080-4772085 | /mnt/storage2/megSAP/tools/vcflib-1.0.3/build/vcfbreakmulti | VcfCheck

vcfbreakmulti (vcflib): 3min 28sec on NA12878_45_var_annotated.vcf
VcfBreakMulti (ngs-bits): 20sec on NA12878_45_var_annotated.vcf

VcfCheck output (vcflib):

WARNING: First base of insertion/deletion not matching - ref: 'T' alt: 'GC'! - in line 3696:
chr1	2331965	.	T	GC	258	low_conf_region	ABP=44;CSQ=GC|intron_variant|ENST00000378531.8|Transcript|||protein_coding|,GC|regulatory_region_variant|ENSR00001164926|RegulatoryFeature|||TF_binding_site|;CSQ2=GC|intron_variant|MODIFIER|MORN1|HGNC:25852|ENST00000378531.8|Transcript||12/13|c.1250+4504delinsGC|;MES_SWA=0.0&0.3&-1.1&0.0&0.0&-17.6&ENST00000378531;MQM=60;NGSD_COUNTS=0,715,0;NGSD_GENE_INFO=MORN1%20(inh%3Dn/a%20oe_syn%3D1.00%20oe_mis%3D1.05%20oe_lof%3D0.77);NGSD_GROUP=0,121;SAF=7;SAP=6;SAR=12;SpliceAI=GC|MORN1|0.00|0.00|0.00|0.00|-46|-14|8|42	GT:DP:AO:GQ	0/1:76:19:142

VcfCheck output (ngs-bits):

WARNING: First base of insertion/deletion not matching - ref: 'T' alt: 'GC'! - in line 3696:
chr1	2331965	.	T	GC	258	low_conf_region	MQM=60;SAP=6;SAR=12;SAF=7;ABP=44;CSQ=GC|intron_variant|ENST00000378531.8|Transcript|||protein_coding|,GC|regulatory_region_variant|ENSR00001164926|RegulatoryFeature|||TF_binding_site|;CSQ2=GC|intron_variant|MODIFIER|MORN1|HGNC:25852|ENST00000378531.8|Transcript||12/13|c.1250+4504delinsGC|;MES_SWA=0.0&0.3&-1.1&0.0&0.0&-17.6&ENST00000378531;SpliceAI=GC|MORN1|0.00|0.00|0.00|0.00|-46|-14|8|42;NGSD_COUNTS=0,715,0;NGSD_GROUP=0,121;NGSD_GENE_INFO=MORN1%20(inh%3Dn/a%20oe_syn%3D1.00%20oe_mis%3D1.05%20oe_lof%3D0.77)	GT:DP:AO:GQ	0/1:76:19:142

EDIT: After re-analysis of NA12878_45 no VcfCheck Warnings, neither for ngs-bits VcfBreakMulti result nor for vcflibs vcfbreakmulti result.

When calling on the extracted region of DNA2106177:
- ngs-bits VcfBreakMulti results in a flawless VcfCheck
- vcflibs vcfbreakmulti results in 6 WARNINGs after VcfCheck ( VcfCheck_DNA2106177_region_vcflib_out.txt )

e.g.:
WARNING: For sample 'DNA2106177A1_02 / annotation 'AD' (number=R), the number of values is 3, but should be 2! - in line 2619:
chr1	4772083	.	ATTTTT	A	388.8	PASS	AC=1;AF=0.500;AN=2;DP=46;FS=0.000;FractionInformativeReads=0.911;MQ=250.00;MQRankSum=0.000;QD=8.45;ReadPosRankSum=0.000;SOR=1.075	GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB	./1:0,24,17:0.5854:41:0,9,10:0,15,7:44:411,350,54,1048,0,58:3.8880e+02,3.3880e+02,4.5857e+01,1.0373e+03,1.5617e-04,5.0000e+01:0.00,11.00,14.01,11.00,22.00,14.01:0,0,24,17:0,0,28,13