rasmushenningsson/VariantCallFormat.jl

Weird headers throwing error

Closed this issue · 6 comments

Hi,

I received some (G)VCF files where lines 16-19 of the header lines look like

##GVCFBlock0-20=minGQ=0(inclusive),maxGQ=20(exclusive)
##GVCFBlock20-30=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40-100=minGQ=40(inclusive),maxGQ=100(exclusive)

which causes the following error

using VariantCallFormat
file = "MH0289561.v1.1a091483-abb5-4bb4-b14a-b5e4046d0a84.rb.g.vcf"
reader = VCF.Reader(open(file))

ERROR: VariantCallFormat.Reader file format error on line 16
Stacktrace:
 [1] error(::String, ::Int64)
   @ Base ./error.jl:42
 [2] _readheader!(reader::VariantCallFormat.Reader, state::BioCore.Ragel.State{BufferedStreams.BufferedInputStream{IOStream}})
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:106
 [3] readheader!(reader::VariantCallFormat.Reader)
   @ VariantCallFormat ~/.julia/packages/BioCore/YBJvb/src/ReaderHelper.jl:80
 [4] Reader
   @ ~/.julia/packages/VariantCallFormat/wT4q6/src/reader.jl:7 [inlined]
 [5] VariantCallFormat.Reader(input::IOStream)
   @ VariantCallFormat ~/.julia/packages/VariantCallFormat/wT4q6/src/reader.jl:20
 [6] top-level scope
   @ REPL[7]:1

However the following works (deleting the - in the key name)

##GVCFBlock020=minGQ=0(inclusive),maxGQ=20(exclusive)
##GVCFBlock2030=minGQ=20(inclusive),maxGQ=30(exclusive)
##GVCFBlock3040=minGQ=30(inclusive),maxGQ=40(exclusive)
##GVCFBlock40100=minGQ=40(inclusive),maxGQ=100(exclusive)

Could you consider changing the behavior of your package? I'm not sure if including the - in the header invalidates the VCF spec (this is v4.2), however.

Hi!

The specification does not explicitly state what characters are allowed for the keys.
It seems reasonable to me to support arbitrary UTF8 keys (only disallowing = in the key name, since that separates the key from the value).

For personal reasons, I don't have much time this week. But I hope to take a look at the implementation early next week.

Hi, sorry for slow response times. I looked into this quickly and it's slightly more tricky to fix than I first imagined to solve it as generally as I want (UTF8).

Probably a good step forward would be to support more characters (including -), but not aim for UTF8 yet. Would that still be useful for you?

Hi, thanks for getting back! That'll definitely be useful!

Currently I just delete the extra - whenever I need to work with these files, but it might be a good idea to add support for at least - since I think GVCF files are rather common? Other people may encounter the same issue.

For reference, I'm currently one of these "other people" experiencing this issue. Just adding - and + would be appreciated.

Sorry for keeping you all waiting. I have not forgotten about this. Just overwhelmed with other work. I hope I can fix it soon.

I have added support for - and + in header tags and dict keys. A new release (v0.5.5) is currently being registered.

If you run into any problems, please reopen this issue.