Performance notes / ideas
fedarko opened this issue · 0 comments
fedarko commented
Ideas for speeding this up
- Use sparse matrices
- Faster k-mer counting / comparison algorithms (update: I used suffix arrays, although the way I used them could probs be sped up)
- Use input sequence size to determine which k-mer counting method to use (the older naive, memory-inefficient method is faster for small inputs than the suffix-array-based method)
- Do the conversions of
s1
,s2
, andrc(s2)
to bytes at the start of matrix construction, and then do everything thereafter in bytes? (At the very least, don't convert boths2
andrc(s2)
to bytes separately; that's silly.) - Use Cython / etc.
- Support FASTA files as input and then process them in chunks or something -- removes need to store massively long sequences in memory (not sure how this would work with pydivsufsort, tho)
- Use rolling hashes? See https://bioinformatics.stackexchange.com/questions/19/are-there-any-rolling-hash-functions-that-can-hash-a-dna-sequence-and-its-revers (and also LJA paper)
- Replace
rc()
function withstr.maketrans
: https://bioinformatics.stackexchange.com/questions/3583/what-is-the-fastest-way-to-get-the-reverse-complement-of-a-dna-sequence-in-pytho - If both sequences are equal (i.e. we're creating a self dot plot), use this to speed up dot plot construction. Some ideas:
- Don't bother creating an extra suffix array
- Only fill in one half of the matrix triangle, since the upper and lower triangle in a self dot plot should be symmetric? (this might be hard to do using the suffix array approach, tho)