Performance notes / ideas

Question

fedarko opened this issue a year ago · 0 comments

Ideas for speeding this up

Use sparse matrices
Faster k-mer counting / comparison algorithms (update: I used suffix arrays, although the way I used them could probs be sped up)
Use input sequence size to determine which k-mer counting method to use (the older naive, memory-inefficient method is faster for small inputs than the suffix-array-based method)
Do the conversions of s1, s2, and rc(s2) to bytes at the start of matrix construction, and then do everything thereafter in bytes? (At the very least, don't convert both s2 and rc(s2) to bytes separately; that's silly.)
Use Cython / etc.
Support FASTA files as input and then process them in chunks or something -- removes need to store massively long sequences in memory (not sure how this would work with pydivsufsort, tho)
Use rolling hashes? See https://bioinformatics.stackexchange.com/questions/19/are-there-any-rolling-hash-functions-that-can-hash-a-dna-sequence-and-its-revers (and also LJA paper)
Replace rc() function with str.maketrans: https://bioinformatics.stackexchange.com/questions/3583/what-is-the-fastest-way-to-get-the-reverse-complement-of-a-dna-sequence-in-pytho
If both sequences are equal (i.e. we're creating a self dot plot), use this to speed up dot plot construction. Some ideas:
- Don't bother creating an extra suffix array
- Only fill in one half of the matrix triangle, since the upper and lower triangle in a self dot plot should be symmetric? (this might be hard to do using the suffix array approach, tho)