Bcftools norm: [E::vcf_format] Invalid BCF
Closed this issue · 4 comments
quentin67100 commented
Hello,
I try to use bcftools norm with the command :
bcftools view --threads 6 \
-f .,PASS \
${TMP_DIR}/${VCF} \
--regions-file ${interval} \
| \
bcftools norm \
-m - -w 10000 -f ${FASTA} --output-type z --output ${VCF_DIR}/${VCF/vcf.gz/select.vcf.gz} --threads 4 -
With version 1.10.2 it's work well but with the version 1.15.1 i get this error :
[E::vcf_format] Invalid BCF, the INFO tag id=54 is too large at chr1:930939
[flush_buffer] Error: cannot write to /path/to/my/output.vcf.gz
pd3 commented
There were some sanity checks added into htslib so the program is now less willing to accept malformed files. In order to debug this, we need a small reproducible test case. Can you extract the offending record from the file and show what it looks like, together with the VCF header?
quentin67100 commented
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=DN,Number=1,Type=String,Description="Possible values are 'Inherited', 'DeNovo' or 'LowDQ'. Threshold for passing de novo call: SNPs: 0.05, INDELs: 0.02">
##FORMAT=<ID=DPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods used for DQ calculation">
##FORMAT=<ID=DQ,Number=1,Type=Float,Description="De novo quality">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Sample filter, 'PASS' indicates that all filters have passed for this sample">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Phred-scaled posterior probabilities for genotypes as defined in the VCF specification">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PP,Number=G,Type=Integer,Description="Phred-scaled posterior genotype probabilities using pedigree prior probabilities">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=SQ,Number=A,Type=Float,Description="Somatic quality">
##DRAGENCommandLine=<ID=HashTableBuild,Version="SW: 01.003.044.3.9.0-340-ge806ab71, HashTableVersion: 8",CommandLineOptions="/opt/edico/bin/dragen --output-directory=/staging/tmp/suite_def/226334/illumina-isi07/scratch/dragen_team_share3/users/skannan/HT_alt_masked-build/build_hash_table_hg38_alt_masked_graph_reference.py/EId1/application_output --output-file-prefix=dragen.RunDragenNoDefaultsStep-NA.INV226334-EId1 --events-log-file=/illumina/scratch/test_runner_output/suite_logs/226334/illumina-isi07/scratch/dragen_team_share3/users/skannan/HT_alt_masked-build/build_hash_table_hg38_alt_masked_graph_reference.py/EId1/application_output/dragen_events.csv --ht-reference=/staging/tmp/suite_def/illumina-isi07/scratch/dragen_datasets/data/vault/reference_genomes/Hsapiens/hg38/seq/hg38.fa --ht-num-threads=40 --build-hash-table=true --ht-build-rna-hashtable=true --enable-cnv=true --ht-pop-alt-contigs=/illumina-isi07/scratch/dragen_datasets/data/vault/reference_genomes/Hsapiens/hg38_alt_masked_graph/liftover_pop_snps/V1/EUR16.phasedAlts.uniq.fasta.gz --ht-pop-alt-liftover=/illumina-isi07/scratch/dragen_datasets/data/vault/reference_genomes/Hsapiens/hg38_alt_masked_graph/liftover_pop_snps/V1/EUR16.phasedAlts.uniq.liftover.sam.gz --ht-pop-snps=/illumina-isi07/scratch/dragen_datasets/data/vault/reference_genomes/Hsapiens/hg38_alt_masked_graph/liftover_pop_snps/V1/EUR16_pop_snps.merged.vcf.gz --ht-mask-bed=/illumina-isi07/scratch/dragen_datasets/data/vault/reference_genomes/Hsapiens/hg38_alt_masked_graph/liftover_pop_snps/V1/hg38_alt_masked.N4.bed">
##DRAGENCommandLine=<ID=dragen,Version="SW: 05.021.609.3.9.5, HW: 05.021.609",Date="Wed Feb 02 10:00:32 UTC 2022",CommandLineOptions="--lic-server https://XXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY@license.edicogenome.com --lic-instance-id-location /root/.edico --output_status_file /data/scratch/progress.log --enable-vcf-compression true --enable-joint-genotyping true --vc-pedigree /data/input/appresults/279867612/Fam001.WES.dante.ped --output-directory /data/output/appresults/326684360/joint --output-file-prefix WES_dante --variant-list /data/scratch/gvcf_sheet.txt --ref-dir /data/scratch/hg38-altmasked-cnv-graph-anchor.v8">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (informative and non-informative); some reads may have been filtered based on mapq etc.">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=FractionInformativeReads,Number=1,Type=Float,Description="The fraction of informative reads out of the total reads">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="At least one variant at this position is somatic">
##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41">
##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83">
##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
##FILTER=<ID=DRAGENHardQUAL,Description="Set if true:QUAL < 10.4139">
##FILTER=<ID=LowGQ,Description="Set if true:GQ = 0">
##FILTER=<ID=lod_fstar,Description="Variant does not meet likelihood threshold (default threshold is 6.3)">
##FILTER=<ID=base_quality,Description="Site filtered because median base quality of alt reads at this locus does not meet threshold">
##FILTER=<ID=filtered_reads,Description="Site filtered because too large a fraction of reads have been filtered out">
##FILTER=<ID=fragment_length,Description="Site filtered because absolute difference between the median fragment length of alt reads and median fragment length of ref reads at this locus exceeds threshold">
##FILTER=<ID=low_depth,Description="Site filtered because the read depth is too low">
##FILTER=<ID=low_frac_info_reads,Description="Site filtered because the fraction of informative reads is below threshold">
##FILTER=<ID=low_normal_depth,Description="Site filtered because the normal sample read depth is too low">
##FILTER=<ID=long_indel,Description="Site filtered because the indel length is too long">
##FILTER=<ID=mapping_quality,Description="Site filtered because median mapping quality of alt reads at this locus does not meet threshold">
##FILTER=<ID=multiallelic,Description="Site filtered because more than two alt alleles pass tumor LOD">
##FILTER=<ID=non_homref_normal,Description="Site filtered because the normal sample genotype is not homozygous reference">
##FILTER=<ID=no_reliable_supporting_read,Description="Site filtered because no reliable supporting somatic read exists">
##FILTER=<ID=panel_of_normals,Description="Seen in at least one sample in the panel of normals vcf">
##FILTER=<ID=read_position,Description="Site filtered because median of distances between start/end of read and this locus is below threshold">
##FILTER=<ID=RMxNRepeatRegion,Description="Site filtered because all or part of the variant allele is a repeat of the reference">
##FILTER=<ID=strand_artifact,Description="Site filtered because of severe strand bias">
##FILTER=<ID=str_contraction,Description="Site filtered due to suspected PCR error where the alt allele is one repeat unit less than the reference">
##FILTER=<ID=too_few_supporting_reads,Description="Site filtered because there are too few supporting reads in the tumor sample">
##FILTER=<ID=weak_evidence,Description="Somatic variant score does not meet threshold">
##FILTER=<ID=DRAGENHardQUAL,Description="Set if true:QUAL < 10.4139">
##FILTER=<ID=LowGQ,Description="Set if true:GQ = 0">
##PEDIGREE=<Child=ATHZ11012BISWES,Mother=ATHZ21913MAMWES,Father=ATHZ10512DUDWES>
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrM,length=16569>
##contig=<ID=chr1_KI270706v1_random,length=175055>
##contig=<ID=chr1_KI270707v1_random,length=32032>
##contig=<ID=chr1_KI270708v1_random,length=127682>
##contig=<ID=chr1_KI270709v1_random,length=66860>
##contig=<ID=chr1_KI270710v1_random,length=40176>
##contig=<ID=chr1_KI270711v1_random,length=42210>
(all the alternative contig)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ATHZ10512DUDWES ATHZ11012BISWES ATHZ21913MAMWES
chr1 930939 rs9988021 G A 807.23 PASS AC=6;AF=1.000;AN=6;DP=197;FS=0.000;MQ=250.00;QD=4.10;SOR=2.061;DB GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DN 1/1:0,68:1.000:68:202:PASS:0,34:0,34:289,205,0:2.5168e+02,2.0168e+02,0.0000e+00:340,210,0:. 1/1:0,66:1.000:66:196:PASS:0,39:0,27:283,199,0:2.4565e+02,1.9565e+02,0.0000e+00:415,279,0:Inherited 1/1:0,63:1.000:63:186:PASS:0,41:0,22:273,189,0:2.3571e+02,1.8571e+02,0.0000e+00:340,194,0:.
pd3 commented
The program gives a suggestion about what's wrong:
bcftools view rmme.vcf -o /dev/null -Ob
[W::vcf_parse_info] INFO 'DB' is not defined in the header, assuming Type=String
[E::bcf_write] Unchecked error (2) at chr1:930939
[main_vcfview] Error: cannot write to /dev/null
When I add the missing definition, it runs fine
##INFO=<ID=DB,Number=0,Type=Flag,Description="xxx">
pd3 commented
I believe this is resolved, please reopen if not.