Looping over a VCF file seems to incur huge memory
biona001 opened this issue · 0 comments
biona001 commented
I'm writing a routine to import a VCF file as a numeric matrix, but I get a much larger memory usage than expected.
As a minimum working example, consider the code below that loops over a VCF file:
using GeneticVariation
function loop_vcf()
reader = VCF.Reader(open("target.vcf", "r"))
s = 0
for record in reader, geno in record.genotype
s += 1
end
close(reader)
return s
end
On a test data (target.vcf.gz, must decompress first) with 3000 records and 100 samples, I get the following benchmark:
using BenchmarkTools
@benchmark loop_vcf()
BenchmarkTools.Trial:
memory estimate: 98.64 MiB
allocs estimate: 941005
--------------
minimum time: 62.249 ms (5.75% GC)
median time: 63.186 ms (5.99% GC)
mean time: 63.835 ms (6.75% GC)
maximum time: 79.381 ms (5.22% GC)
--------------
samples: 79
evals/sample: 1
Why am I getting such a large memory requirement? My data target.vcf
is only 1.3MB on disk, so I feel like this memory usage is highly suspicious..