samtools/bcftools

bcftools 1.10 fails to index files with wrong INFO/END values but bcftools 1.9 worked fine

freeseek opened this issue · 21 comments

The following code reproduces the error:

bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Oz -o output.vcf.gz \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
bcftools index -ft output.vcf.gz

The problem here is that the 1000 Genomes project team made a mess and forgot to liftOver the END coordinates of the VCF. I am not sure whether this counts as a bcftools bug. But the VCF specification does not really call for the END field to always be bigger than the POS field, which is what is causing the problem here. I do find odd that tabix and previous versions of bcftools worked fine with these odd cases though.

If this is considered appropriate bcftools behavior, then I think that to a minimum bcftools norm should also check than END is always larger than POS.

Also, more importantly, if I output the VCF to a binary format and then I try to index as follows:

bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Ob -o output.bcf \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
bcftools index -f output.bcf

I get this very weird error:

[E::hts_idx_check_range] Region 4461912..4299347302 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 7
index: failed to create index for "output.bcf"

It seems like there was a failure to check for END being greater than POS in the encoding to binary format that caused some sort of overflow.

pd3 commented

Can you please try with the latest htslib version? This is a duplicate of samtools/htslib#1006 and was addressed by samtools/htslib@ea879e1.

I have just re-cloned and re-compiled bcftools and I get the same exact errors. I have noticed though that the download of the VCF snippet with bcftools view often fails, sometimes yielding a VCF without variants, but if I retry it many many times it eventually works. I am not sure why this is. Maybe it is a temporary issue at the time of this writing with the EBI webiste.

Actually I am noticing this HTSlib problem that is unrelated but it seems also like a serious HTSlib bug, so it might be worth it reporting it. At the very moment the EBI website seems to return 404 errors for files that are actually present. When I run the following command:

bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz

I get three possible outputs:

Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: could not parse header

Or I get a VCF without variants or I get a VCF with both rs6640136 and rs57675010 variants.

The first output is okay, as it means the EBI website provided a 404 answer. The third outcome is the desired outcome. The second outcome seems to be due to the fact that the EBI website provided the header fine but when bcftools tried to get the X:4380006-4461913 region it provided a 404 answer. This is troublesome because bcftools did not throw any error. This does not seem like a desired behavior.

pd3 commented

I am not able to reproduce, I only get a correct warning with the latest version:

$ bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[W::vcf_parse] INFO/END=4380006 is smaller than POS at X:4461913

$ bcftools index rmme.bcf 

$ bcftools --version
bcftools 1.10.2-18-g5774f32
Using htslib 1.10.2-22-gbfc9f0d

Make sure the latest version is used and a truncated version of the remote index is not present in the local directory.

I have the same version as you do.

$ bcftools index -ft output.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913
index: failed to create index for "output.vcf.gz"
$ bcftools index -f output.bcf   
[E::hts_idx_check_range] Region 4461912..4299347302 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 7
index: failed to create index for "output.bcf"
$ bcftools --version          
bcftools 1.10.2-18-g5774f32
Using htslib 1.10.2-22-gbfc9f0d
Copyright (C) 2019 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Also, I am sure it is not just me as this bug was reported to me from someone in Finland that just installed HTSlib and BCFtools from github a few days ago.

pd3 commented

You are only showing the indexing step. Can you make sure the BCF was created by the new version as well? Either remove and recreate or look in the header for the stamp.

Otherwise, after some experimentation I see that several problems can occur:

  1. existing but empty index file
$ rm -f ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi  && touch ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi

$ bcftools view -G -r X:4380006-4461913 -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[E::hts_idx_load3] Could not load local index file 'ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi'
Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: could not load index
  1. existing but truncated index file
$ wget -O xxx.tbi http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi
$ dd if=xxx.tbi of=ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi bs=1 count=100
$ bcftools view -G -r X:4380006-4461913 -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
[E::hts_idx_load3] Could not load local index file 'ALL.chrX_GRCh38.genotypes.20170504.vcf.gz.tbi'
Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: could not load index
  1. failed download because of remote 404 errors
$ bcftools view -G -r X:4380006-4461913 -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz
[E::hts_open_format] Failed to open file "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz" : No such file or directory
Failed to read from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz: No such file or directory

The error message in the last case could be made more informative and give a hint that the problem can be temporary, otherwise it seems to be working OK.

I am sorry to mix two different bugs, but it is not a problem of the indexing file being truncated. That was downloaded once and for all and it copied fine. What happens to me occasionally is that I run:

$ bcftools view --no-version -G -r X:4380006-4461913 -i 'ID="rs6640136" || ID="rs57675010"' -Ob -o rmme.bcf http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz

which runs without any errors or warnings. And then I get:

$ bcftools view rmme.bcf | grep -v ^##
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

Which means the header loaded okay, but then I get no bgzf_read or hts_idx_load3 error. Somehow EBI just did not send anything. I am not sure whether HTSlib can tell that it was a 404 error message.

Sorry, you are right, the latest version of bcftools does not reproduce the malformed bcf file. May this got mixed with a malformed index file or something. However, the following recreates the problem with vcf.gz files:

$ bcftools --version
bcftools 1.10.2-18-g5774f32
Using htslib 1.10.2-22-gbfc9f0d
Copyright (C) 2019 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ (echo "##fileformat=VCFv4.1"
echo "##contig=<ID=X,length=155270560>"
echo "##INFO=<ID=END,Number=1,Type=Integer,Description=\"End coordinate of this variant\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "X\t4461533\trs6640136\tA\tG\t.\t.\t."
echo -e "X\t4461913\trs57675010\tGAAGTCATTGTACAAGGAAAACATTCTGTATGCACTAAGATTTCCCAAAAGAT\tG\t.\t.\tEND=4380006") | \
  bcftools view --no-version -Oz -o output.vcf.gz
[W::vcf_parse] INFO/END=4380006 is smaller than POS at X:4461913
$ bcftools index -ft output.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913
index: failed to create index for "output.vcf.gz"

While bcftools 1.9 was capable of indexing just fine. You can close the issue.

But the VCF specification does not really call for the END field to always be bigger than the POS field, which is what is causing the problem here.

In fact it does — see samtools/hts-specs#266 (comment) and samtools/hts-specs#436, and also #1153.

I do find odd that tabix and previous versions of bcftools worked fine with these odd cases though.

In fact they did not. Prior to 1.10, records with INFO/END < POS could lead to bcftools index producing a broken index that would fail to return any variants for certain queries. See this biostars posting, which led to the addition of the error message you're seeing in samtools/htslib#917:

$ bcftools index -ft output.vcf.gz
[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913

It is unclear whether your query is returning no results due to this problem with a previously built broken index. Do your query coordinates vary, or are you saying you are intermittently getting different results for X:4380006-4461913?

Okay, I have made a mess of a bug report, but here is the bug, reproducible with the latest version of bcftools:

bcftools view -i 'ID="rs59780433"' -Ou -G -r X:85725113-86470037 \
  http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz | \
  bcftools norm -Ob -o output.bcf -m -any && \
  bcftools index -f output.bcf

This is quicker way to reproduce it

(echo "##fileformat=VCFv4.1"
echo "##contig=<ID=X,length=155270560>"
echo "##INFO=<ID=END,Number=1,Type=Integer,Description=\"End coordinate of this variant\">"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
echo -e "X\t86470037\trs59780433\tTTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG\tTGGTTCATTGATATTCTGCTGTGGCACTGTTAATCTTATTTTGATTGAAACAGAGTTCTGGACAGTAAAATTGATG,T\t.\t.\tEND=85725113") | \
  bcftools norm --no-version -m -any -Ob -o output.bcf && \
  bcftools index -f output.bcf

In both cases the error I get is the following:

[E::hts_idx_check_range] Region 86470036..4380692409 cannot be stored in a csi index with min_shift = 14, n_lvls = 5. Try using min_shift = 14, n_lvls >= 7

The problem seems to arise from bcftools norm splitting a variant with INFO/END smaller than POS. I previously dissected the bug incorrectly.

pd3 commented

I see it now. The norm command creates a new VCF record and htslib does not check the sanity of the END tag when writing out BCF.

@jmarshall @pd3 This bug recently hit pysam. It wasn't in v0.15.4, it appeared in v0.16.0.

This is a feature, not a bug. The previous behaviour of not diagnosing the invalid INFO/END value and producing an unusable index was buggy. See #1154 (comment), in particular the biostars posting linked from there.

What program has produced the INFO/END values that you are seeing diagnostic messages about?

@jmarshall I tried to index the same chrX archive from 1000 Genomes as @freeseek.

Just in case, I quote an error message:

[E::hts_idx_push] Invalid record on sequence #1: end 4380006 < begin 4461913
X.vcf.gz.tbi... Traceback (most recent call last):
  File "ld_triangle.py", line 444, in <module>
    prep_single_proc = PrepSingleProc(args)
  File "ld_triangle.py", line 95, in __init__
    self.intgen_convdb_path = collect_intgen_data(self.intgen_dir_path)
  File "/home/platon/_0_Диссертация/00_Программы/ld-tools-master/backend/collect_intgen_data.py", line 133, in collect_intgen_data
    tabix_index(intgen_vcf_path, preset='vcf')
  File "pysam/libctabix.pyx", line 1035, in pysam.libctabix.tabix_index
OSError: building of index for /home/platon/_0_Диссертация/Exp/LD_heatmap_test/1000G/X.vcf.gz failed

@jmarshall, for background, the file being discussed is a liftover of 1000 Genomes phase three from GRCh37 to GRCh38. The END coordinate issue was introduced by the liftover process (which in this case was fairly complicated, going via dbSNP). Brief documentation (including this issue) is here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/README_GRCh38_liftover_20170504.txt. Ensembl share a version of this data where (I believe) these cases will have been filtered out - I notice that that isn't mentioned in the existing README (we will aim to update it). Also, probably worth highlighting that there are increasing numbers of call sets for the 1000 Genomes samples called directly on GRCh38 (where liftover artefacts are not an issue and all of GRCh38 was included in the calling) . We are happy to take questions on the data at info@1000genomes.org.

As far as I understand, liftover was applied for all chromosomes. Then why does this problem occur only for chrX?

I think chromosome X received special treatment. Notice for example that they forgot to liftOver the PAR1 and PAR2 regions ...

@freeseek - regarding chromosome X, as noted, the liftover process for this data set was somewhat atypical/complicated going via dbSNP (rather than liftOver directly GRCh37 to GRCh38). Given that chromosome X frequently suffers from anomalies associated with the PARs, etc., I find it plausible that there are some problems specific to chromosome X but I'm not aware that the processing of chromosome X was any different from the others (based on the information available to me).

@PlatonB - I think the safest approach is to treat these case by case. This could be down to circumstances that only occur on chromosome X but I'm afraid I can't guarantee that. You are correct that the liftover was applied to all chromosomes.

This doesn't remove the fact that the data is erroneous otherwise the newer code wouldn't be complaining. If 1000 genomes have erroneous data then they need to be informed. Previously it used this invalid fields without checking and could silently give the wrong results. There's nothing we can do to make it give the correct results as the primary data needs fixing.

@jkbonfield, agreed that this is a data issue and not an issue with the software - the current software behaviour described (not allowing the error to go undetected) sounds correct to me. For the data, 1000 Genomes are aware - this has been listed as a known issue for two years. We will, however, point people to the filtered version of the files hosted by Ensembl.