samtools/htsjdk

Can HTSJDK use a VCF index to quickly count total records in a VCF?

bbimber opened this issue · 9 comments

Hello,

When working with a large VCF, iterating all features to determine the total variant count is slow. Can Can HTSJDK use a VCF index to quickly count total records in a VCF?

Thanks

Someone else may have a more definitive answer, but I think the linear index part of a Tribble index (.idx) has that information, per-chromosome. I don't think tabix does.

@cmnbroad well it should be possible as you can get this information with bcftools index -s in.vcf.gz

exactly. i also didnt know this was possible, but bcftools apparently can do it. it would be very useful to be able to get variant count like this for big files.

@yfarjoun with a recent version of bcftools, I'm able to extract the number of variants/chrom with a tbi index and bcftools index -s.

@yfarjoun bcftools. (but I think now both tools now use the same C code for tbi )

@yfarjoun the C code collecting metadata is here : https://github.com/samtools/htslib/blob/1d79f449cb3b02eda8fc151556395b7b50ccd730/hts.c#L2857

Indexes (both .tbi and .csi) made by tabix include extra data about the indexed file. The returns a pointer to this data. Note that the data is stored exactly as it is in the index. Callers need to interpret the results themselves, including knowing what sort of data to expect byte swapping etc.

all of our indexes are made by tabix and have this info, which makes sense if bcftools/tabix share the same code