BioJulia/GeneticVariation.jl

Looping over a VCF file seems to incur huge memory

biona001 opened this issue · 0 comments

I'm writing a routine to import a VCF file as a numeric matrix, but I get a much larger memory usage than expected.

As a minimum working example, consider the code below that loops over a VCF file:

using GeneticVariation
function loop_vcf()
    reader = VCF.Reader(open("target.vcf", "r"))
    s = 0
    for record in reader, geno in record.genotype
        s += 1
    end
    close(reader)
    return s
end

On a test data (target.vcf.gz, must decompress first) with 3000 records and 100 samples, I get the following benchmark:

using BenchmarkTools
@benchmark loop_vcf()
BenchmarkTools.Trial:
  memory estimate:  98.64 MiB
  allocs estimate:  941005
  --------------
  minimum time:     62.249 ms (5.75% GC)
  median time:      63.186 ms (5.99% GC)
  mean time:        63.835 ms (6.75% GC)
  maximum time:     79.381 ms (5.22% GC)
  --------------
  samples:          79
  evals/sample:     1

Why am I getting such a large memory requirement? My data target.vcf is only 1.3MB on disk, so I feel like this memory usage is highly suspicious..