BCF.chrom always returns 1
Opened this issue · 3 comments
Hello,
I am unable to get BCF.chrom
to output anything other than 1
(Julia 1.1.0; RHEL 7.6).
I tested this with multiple BCF files. When inspected with bcftools view
, these same files show the correct chromosome name, in the expected field:
> bcftools view bcftest.bcf.gz
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype calls">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##contig=<ID=22>
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype call probabilities">
##INFO=<ID=QUAL,Number=1,Type=Float,Description="Imputation Quality">
##INFO=<ID=TAG,Number=1,Type=String,Description="rs_id from annotation files">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view bcftest.bcf.gz; Date=Wed Jun 12 02:41:11 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XXXX
22 16050075 22:16050075:A:G A G 0.123 PASS TAG=22:16050075:A:G;AC=0;AN=0 GT:DS:GP ./.:0:1,0,0
22 16050115 22:16050115:G:A G A 0.404 PASS TAG=22:16050115:G:A;AC=0;AN=0 GT:DS:GP ./.:0:1,0,0
However, BCF.chrom
tells a different story:
julia> using GeneticVariation
julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
metainfo tags: fileformat FILTER FORMAT contig INFO
sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))
julia> record = read(reader)
GeneticVariation.BCF.Record:
chromosome: 1
position: 16050075
identifier: 22:16050075:A:G
reference: A
alternate: G
quality: 0.123
filter: 1
information: Tuple{Int64,Any}[(5, "22:16050075:A:G"), (6, 0), (7, 0)]
julia> BCF.chrom(record)
1
julia> BCF.pos(record)
16050075
The issue doesn't seem connected to the actual implementation of BCF.chrom()
. Reading straight from the BGZFStream
causes the same issue:
julia> reader = BCF.Reader(open("bcftest.bcf.gz", "r"))
GeneticVariation.BCF.Reader{IOStream}((0x02, 0x02), GeneticVariation.VCF.Header:
metainfo tags: fileformat FILTER FORMAT contig INFO
sample IDs: XXXX, BGZFStreams.BGZFStream{IOStream}(<mode=read>))
julia> misc = read(reader.stream, Int64) #skip first 8 bytes
115964117068
julia> chrom = read(reader.stream, Int32) %Int + 1 #next 4 bytes are CHR
1
julia> pos = read(reader.stream, Int32) %Int + 1 #next 4 bytes are POS
16050075
Reproducible example attached: bcftest.bcf.gz
An update:
This does seem to be a bug in GeneticVariation's handling of BCF files. Long story short, instead of storing the explicit chromosome name like in plaintext VCF format, the CHROM field in BCF files stores indices to the list of 'contig' field headers.
In the example file, there is only one ##contig, with corresponding value 22
. Instead of 1
, BCF.Chrom
should probably return the value specified by the first ##contig, i.e. 22
.
Per the VCFv4.3 specs:
6.2.2 Dictionary of contigs
The CHROM field in BCF2 is encoded as an integer offset into the list of ##contig field headers in the VCF header. The offsets begin, like the dictionary of strings, at 0. So for example if in BCF2 the contig value is 10, this indicates that the actual chromosome is the 11th element in the ordered list of ##contig elements [...]
Can you test how this is with TranscodingStreams.jl? All the IO packages are to be upgraded to use transcoding streams eventually for their 2.0 releases. If there's an issue with BGZFStreams, it may be better to use this issue as a reason to get a move on with moving GeneticVariation over to using TranscodingStreams, rather than bug hunt in a package we intend to drop eventually.
I'm fairly sure the issue involves the handling of correctly-decoded data (looks like I updated my post seconds before you commented), so moving to TranscodingStreams.jl shouldn't move the needle much on this particular problem. If GeneticVariation is still intended as the main package for BCF/VCF handling, I'm happy to contribute PR on both fronts, though