Create and query an index of many human genomes?

Question

Create and query an index of many human genomes?

jeremymsimon opened this issue 2 years ago · 0 comments

Hey Team Pufferfish-
I'm just wondering your opinions on best practices if I were to want to index many (hundreds? thousands?) of whole human genomes and then query them. Let's say I have a handful of k=25mers and I want to find out whether, which, where, and how many matches there are in my index. Although the index is huge, the query will always be small, no more than a few hundred at a time, unlike alignment of RNA-seq reads or similar.

In a somewhat-miniaturized (albeit still relatively giant) test, I grabbed sequences from the Human Pangenome Reference (n=94 genomes + CHM13) and am attempting to index them with pufferfish. This alone used >600GB of RAM just in the counting step, and is seemingly nowhere near complete after 24hrs of runtime with 12 threads.

Is this at all feasible? Is Pufferfish an appropriate tool for this? Or are related tools like fulgor or cuttlefish or others better suited for this scale?

I figured other standard k-mer counting tools may be more efficient but from my non-expert perspective it seemed I'd likely sacrifice knowing the genomic locations of the match, and perhaps also sacrifice knowing if a given k-mer matched multiple times in the same genomes and/or within the whole index. Unless it is truly necessary to do otherwise, this is information I'd like to retain in my output. Also note I'm currently approaching this without any downsampling or sparsity (in other words, it is a fully dense all-k-mers-represented index), but if needed I may be able to employ some k-mer selection tricks.

Curious to hear your thoughts on this, and thanks as always!