slowikj/seqR

Change the algorithm for positional k-mer space

slowikj opened this issue · 0 comments

An integer representing a position of a k-mer in a sequence tends to be larger than relatively small P constants used in hash function formula. Therefore, it is not recommended to use it during the hashing of a sequence.

How to fix it?
There are two solutions:

  1. Use a large P
  2. Different positional k-mer hashing approach - use (d + 1)-dimensional representations of k-mer (one extra integer indicates not transformed k-mer position)
  3. Change the dictionary structure and use a specialized version for positional variant; for example, 2-level approach: the first level is the position, the second level -- the hash