pstawinski/pygenebe

genebe commandline tool fails with input vcf

Closed this issue · 2 comments

I have an issue with successfully running the genebe commandline tool.

input command:
genebe annotate --input 24072-01-01_split.vcf --output 24072-01-01_split_genebe.vcf --progress

error message:

Traceback (most recent call last):
File "/Users/armindeffur/my-envs/genebe/bin/genebe", line 8, in
sys.exit(main())
File "/Users/armindeffur/my-envs/genebe/lib/python3.9/site-packages/genebe/entrypoint.py", line 147, in main
annotate_vcf(
File "/Users/armindeffur/my-envs/genebe/lib/python3.9/site-packages/genebe/vcf_simple_annotator.py", line 122, in annotate_vcf
variants_batch = [
File "/Users/armindeffur/my-envs/genebe/lib/python3.9/site-packages/genebe/vcf_simple_annotator.py", line 123, in
f"{variant.CHROM}-{variant.POS}-{variant.REF}-{variant.ALT[0]}"
IndexError: list index out of range

I suspect that the VCF is the issue, as it seems that genebe can't extract the correct chrom-pos-ref-alt information.

VCF file first few lines:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=05/27/24
##reference=hg38_2024
##source=SEQUENCE Pilot_5.4.1
##InputFileList=../Import/20240521_Twist-PCDv2_hg38-2_illumina/24072-01-01_S2_L001_R1_001.fastq.gz;../Import/20240521_Twist-PCDv2_hg38-2_illumina/24072-01-01_S2_L001_R2_001.fastq.gz
##INFO=<ID=Illumina-50x,Number=0,Type=Flag,Description="Analysis Settings Flag; [Settings];Profile: Illumina-50x;Reads include PCR primers: auto;Randomly sheared reads: no;Genome mapping: yes;Single / double direction analysis /;Both dir. min abs. cov.: 19;Both dir. min % cov.: off;Required Coverage /;Min abs. cov.: 10 per dir.;Ratio read dir.: off;Mutations /;Min abs. cov.: off;Ratio read dir.: off;Min % cov.: 10% per dir.;Force combined % cov.:off;Min & cov. homozygosity:85%;Mutation sorting /;Distinct/other cov.: 15% per dir.;Distinct/homop. cov. (deletion) : 40% per dir.;Distinct/homop. cov. (insertion): off per dir.;Homop. region size: 7;Expected Coverage Warning /;Min abs. cov.: 50;[Quality Score];Score thresh.: 15;Ignore reads thres.: 40%;Score cov. warning: off;Score read grouping: yes;[Trimming];Adaptor /;5': ; Error rate: 10; Overlap: 3abs;3': ; Error rate: 10; Overlap: 3abs;Remove bases: 0 (5') / 0 (3') ;[BAM/SAM];Genome from BAM / SAM: ;Mapping: yes;Alignment: yes;Unfiltered: no;[Tags];active: no;R1 tag length: 0;R2 tag length: 0;Min abs. cov. cons.: off;Min per. cov. cons.: off;Ignore cons. read thresh.: off;Ignore N tags: no;Ignore low Qs tags: no;[Fusions];active: no;Mode: open (exome only);Min abs. cov.: 10;Breakpoint spread.: 3;[Expert Settings];Base calling /;Genome Set: Diploid;Unique reads only: no;Read processing /;Compl. reads only: no;Barcode at 5' and 3': no;Ignore paired end info: no;Allow unique paired end reads: no;Require identical paired end overlap: no;Trim amplicons only: no;R1 / R2 read coloring: no;Gene specific primers: no;Alignment evaluation /;Skip evaluation: no;Max mismatches: 15%;Min matching bases: 50%;Keep strong consensus: 50%;Mutation table /;Warning: 50%;InDel gap SNP to SNP: 3;InDel gap SNP to InDel: 3;">
##INFO=<ID=ModifiedSettings,Number=.,Type=String,Description="Individual modifications of ROI settings wrt. to the settings indicated by the 'Analysis Settings Flag'.">
##INFO=<ID=GI,Number=.,Type=String,Description="Gene ID">
##INFO=<ID=TI,Number=.,Type=String,Description="Transcript ID">
##INFO=<ID=WEIGHTING,Number=.,Type=String,Description="Variation and Mutation sorting (distinct, other, homopolymer, filter, temp. filter)">
##INFO=<ID=ClinVitae:Classification,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=gnomAD:AC,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=gnomAD:AF,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=gnomAD:AN,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=gnomAD:Hom,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=ClinVar:Clinical Significance,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=1000Genomes:AF,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=dbSNP:MAF,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=ExAC:AC,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=ExAC:AF,Number=.,Type=String,Description="Mutation Info from Public DBs">
##INFO=<ID=COVFR,Number=2,Type=Integer,Description="# alt-forward reads, alt-reverse reads; for wildtype positions ref-forward reads and ref-reverse reads">
##INFO=<ID=CHGVS,Number=.,Type=String,Description="Codon change based on selected TI in HGVS nomenclature format">
##INFO=<ID=PHGVS,Number=.,Type=String,Description="Protein change based on selected TI in HGVS nomenclature format">
##FILTER=<ID=Illumina-50x,Description="Profile selected in Run Window of SeqNext module">
##FILTER=<ID=q15,Description="Quality below or equal15">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth at this position for this sample">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele frequency for each ALT allele in the same order as listed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Total read depth for each allele">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Read depth for each allele on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Read depth for each allele on the reverse strand">
##contig=<ID=1>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=2>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=X>
##bcftools_normVersion=1.20+htslib-1.20
##bcftools_normCommand=norm -m- 24072-01-01.vcf.gz; Date=Fri Jun 21 12:41:04 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1
1 1922176 rs3039777 T TCTGA . PASS GI=CFAP74;TI=NM_001304360.2;Illumina-50x;WEIGHTING=distinct;dbSNP:MAF=0.975000;COVFR=1044,1037;CHGVS=c.*110_*111insTCAG GT:DP:AF:AD:ADF:ADR 1/1:2226:0.93:145,2081:79,1044:66,1037
1 1930142 rs141833643 C A . PASS GI=CFAP74;TI=NM_001304360.2;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=1037;gnomAD:AF=0.007712;gnomAD:AN=134462;dbSNP:MAF=0.004553;COVFR=950,979;CHGVS=c.3206G>T;PHGVS=p.(Gly1069Val) GT:DP:AF:AD:ADF:ADR 0/1:3885:0.5:1956,1929:969,950:987,979
1 1968747 rs35269416 T C . PASS GI=CFAP74;TI=NM_001304360.2;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=46962;gnomAD:AF=0.188203;gnomAD:AN=249528;dbSNP:MAF=0.111100;COVFR=1890,1655;CHGVS=c.1133A>G;PHGVS=p.(Lys378Arg) GT:DP:AF:AD:ADF:ADR 1/1:3556:1:11,3545:6,1890:5,1655
1 1968793 rs16824588 T C . PASS GI=CFAP74;TI=NM_001304360.2;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=105280;gnomAD:AF=0.422052;gnomAD:AN=249448;dbSNP:MAF=0.228900;COVFR=1985,1183;CHGVS=c.1087A>G;PHGVS=p.(Ile363Val) GT:DP:AF:AD:ADF:ADR 1/1:3170:1:2,3168:1,1985:1,1183
1 1987049 rs4350140 A G . PASS GI=CFAP74;TI=NM_001304360.2;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=130882;gnomAD:AF=0.552189;gnomAD:AN=237024;dbSNP:MAF=0.380100;COVFR=584,430;CHGVS=c.297-14T>C GT:DP:AF:AD:ADF:ADR 0/1:1995:0.51:981,1014:572,584:409,430
1 3682336 rs2273953 G A . PASS GI=TP73;TI=NM_005427.4;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=27759;gnomAD:AF=0.203238;gnomAD:AN=136584;dbSNP:MAF=0.075000;COVFR=647,524;CHGVS=c.-30G>A GT:DP:AF:AD:ADF:ADR 0/1:2803:0.42:1632,1171:876,647:756,524
1 3682346 rs1801173 C T . PASS GI=TP73;TI=NM_005427.4;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=30400;gnomAD:AF=0.201790;gnomAD:AN=150652;dbSNP:MAF=0.075000;COVFR=657,586;CHGVS=c.-20C>T GT:DP:AF:AD:ADF:ADR 0/1:2943:0.42:1700,1243:877,657:823,586
1 3690956 rs3765730 G A . PASS GI=TP73;TI=NM_001126242.3;Illumina-50x;WEIGHTING=distinct;gnomAD:AC=59098;gnomAD:AF=0.312130;gnomAD:AN=189338;dbSNP:MAF=0.226700;COVFR=276,497;CHGVS=c.39+12G>A GT:DP:AF:AD:ADF:ADR 0/1:1596:0.48:823,773:318,276:505,497

Hi,
I am not able to reproduce the error.

I've tried with the input you've provided:
input.vcf.gz

using the docker:

docker run -v ./input.vcf.gz:/tmp/input.vcf.gz -it --rm genebe/pygenebe:0.0.18 genebe annotate --input /tmp/input.vcf.gz --output /dev/stdout

Can you please try if the vcf you are using does not contain empty lines in the end?

For annotating VCF file it is recommended to use https://github.com/pstawinski/genebe-cli/releases .