BioJulia/GeneticVariation.jl

BCF.chrom always returns 1

Opened this issue · 3 comments

Hello,

I am unable to get BCF.chrom to output anything other than 1 (Julia 1.1.0; RHEL 7.6).

I tested this with multiple BCF files. When inspected with bcftools view, these same files show the correct chromosome name, in the expected field:

> bcftools view bcftest.bcf.gz
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype calls">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##contig=<ID=22>
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype call probabilities">
##INFO=<ID=QUAL,Number=1,Type=Float,Description="Imputation Quality">
##INFO=<ID=TAG,Number=1,Type=String,Description="rs_id from annotation files">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view bcftest.bcf.gz; Date=Wed Jun 12 02:41:11 2019
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  XXXX
22      16050075        22:16050075:A:G A       G       0.123   PASS    TAG=22:16050075:A:G;AC=0;AN=0   GT:DS:GP        ./.:0:1,0,0
22      16050115        22:16050115:G:A G       A       0.404   PASS    TAG=22:16050115:G:A;AC=0;AN=0   GT:DS:GP        ./.:0:1,0,0

However, BCF.chrom tells a different story:

julia> using GeneticVariation

julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
  metainfo tags: fileformat FILTER FORMAT contig INFO
     sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))

julia> record = read(reader)
GeneticVariation.BCF.Record:
   chromosome: 1
     position: 16050075
   identifier: 22:16050075:A:G
    reference: A
    alternate: G
      quality: 0.123
       filter: 1
  information: Tuple{Int64,Any}[(5, "22:16050075:A:G"), (6, 0), (7, 0)]

julia> BCF.chrom(record)
1

julia> BCF.pos(record)
16050075

The issue doesn't seem connected to the actual implementation of BCF.chrom(). Reading straight from the BGZFStream causes the same issue:

julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
  metainfo tags: fileformat FILTER FORMAT contig INFO
     sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))

julia> misc = read(reader.stream, Int64) #skip first 8 bytes
115964117068

julia> chrom = read(reader.stream, Int32) %Int + 1 #next 4 bytes are CHR
1

julia> pos = read(reader.stream, Int32) %Int + 1 #next 4 bytes are POS
16050075


Reproducible example attached: bcftest.bcf.gz

An update:

This does seem to be a bug in GeneticVariation's handling of BCF files. Long story short, instead of storing the explicit chromosome name like in plaintext VCF format, the CHROM field in BCF files stores indices to the list of 'contig' field headers.

In the example file, there is only one ##contig, with corresponding value 22. Instead of 1, BCF.Chrom should probably return the value specified by the first ##contig, i.e. 22.

Per the VCFv4.3 specs:

6.2.2 Dictionary of contigs
The CHROM field in BCF2 is encoded as an integer offset into the list of ##contig field headers in the VCF header. The offsets begin, like the dictionary of strings, at 0. So for example if in BCF2 the contig value is 10, this indicates that the actual chromosome is the 11th element in the ordered list of ##contig elements [...]

Can you test how this is with TranscodingStreams.jl? All the IO packages are to be upgraded to use transcoding streams eventually for their 2.0 releases. If there's an issue with BGZFStreams, it may be better to use this issue as a reason to get a move on with moving GeneticVariation over to using TranscodingStreams, rather than bug hunt in a package we intend to drop eventually.

I'm fairly sure the issue involves the handling of correctly-decoded data (looks like I updated my post seconds before you commented), so moving to TranscodingStreams.jl shouldn't move the needle much on this particular problem. If GeneticVariation is still intended as the main package for BCF/VCF handling, I'm happy to contribute PR on both fronts, though