feat: use Hail within the Genotypes class

Question

Closed this issue 3 years ago · 1 comments

Instead of using cyvcf2 inside of the Genotypes class, we could use Hail. Why? Well...

Hail supports PLINK2 and BGEN files automatically. So we wouldn't need to change our code to work with those files later on.
I did a quick back-of-the-envelope calculation and found that PLINK2 files can be read much more quickly: in 1% of the time that it takes to read VCF files composed of the same number of samples and variants. So we're probably gonna want to use PLINK2 later when we scale up to large datasets. The sooner we use Hail, the less code we'll have to rewrite.
Hail supports easy parallelization of reads and writes, so we won't have to worry about running out of memory on a node.

note: this issue is a WIP

Answer 1 · 2022-05-11T20:42:12.000Z

seems like hail does indeed require spark, and we don't want to have to depend on spark, so I'm gonna close this for now