rasmushenningsson/VariantCallFormat.jl

Read from middle of VCF file?

Opened this issue ยท 3 comments

I suppose this package is the same as GeneticVariation.jl, but any chance this package will support multithreaded read? The standard

reader = VCF.Reader(open("example.vcf", "r"))
for record in reader
    # do something
end
close(reader)

requires looping over every record. On large VCF files, just looping through all records can take a few hours. Essentially we need some way to query the reader at the ith position.

Yes, this is a feature that I would really like. There are other more urgent changes needed though, so there might take some time before I get to it.

My idea would be support index files (.tbi or .csi). Then you can create a Reader for e.g. a specific chromosome. And thus work in parallell on one file by working on different chromosomes on different threads. Would this be in line with what you need?

Yes, that sounds great! I'll try to look into index files too, and try to help out in some way if possible.

Since the indexed files usually use BGZF compression (a block-gzip variant), it may be useful to look at how the access is done on such files. Aside from tabix, grabix, and bcftools I noticed also a few Julia packages related to BGZF format, most notably BGZFStreams, but also packages handling BAM could be used for inspiration.