smarco/WFA-paper

Non-zero match score in WFA

yangao07 opened this issue · 4 comments

Continue with what we discussed in #20 , the fast extending of matched bases is definitely the most important feature in WFA, and should not be sacrificed.

To use a non-zero match score in WFA, here is my idea:
Change the original penalty a penalty compared to the max. possible score n * M, n is the query length and M/X/O/E are the score/penalty of match/mismatch/gap open/gap extension.

This way, the match operation still holds the 0 penalty and the fast exact extending can still be used.
For substitution, the penalty becomes X+M.
For deletion/insertion, the gap open penalty becomes O+E or M+O+E, depending on the wavefront's k value. Same for gap extension, the penalty becomes E or M+E, depending on k.

Overall, the WFA works the same, only a few operations need to be added.

I just made a few slides to illustrate this idea, let me know if you have any thoughts about it.
https://docs.google.com/presentation/d/1Na5fRUkuEuLq7lpiiszEIbC4ypTwr1Sg5o9_kX9rg-U/edit?usp=sharing

Cheers!
Yan

Here is the non-zero-match version of WFA: https://github.com/yangao07/WFA/tree/non-zero-match.
nzm-gap-affine-wfa is also added to align_benchmark.
There are some potential bugs though.

Hello again.

Thank you very much for the slides. Nevertheless, I must confess that I am not entirely following your stream-of-though. In particular, I would appreciate elaboration on these two questions:

  1. Assuming M>0, I don't fully understand why matches on certain diagonals account for the overall score, but on others don't
  2. Extending m matches would naturally increase the global score s in m x M. But one of the "constrains" of the algorithm is to keep computing further-reaching cells with the same score s at each step. I don't understand how we can extend multiple matches and keep computing wavefront of score s without having paths of different costs.

Thanks again for the help & brainstorming,

It is indeed not easy to understand what I tried to say there in the slides.
The general idea is similar to the original WFA, calculate the total penalty instead of the score.
The only difference is that the penalty is subtracted from the potential max. score: min(A,B) * M, here A/B/M: query length/target length/match score. So,

  1. For M > 0, some operations have an extra penalty of M because they lead one fewer match operation.
  2. For exactly the same bases, extending them still leads to zero penalties, this is the same as in the original WFA.

Anyway, not sure if you have read the preprint paper: https://www.biorxiv.org/content/10.1101/2022.01.12.476087v1
The authors point out that there is actually an equivalence between the conventional score scheme and the 0-match score scheme.
So, I think this is actually unnecessary to add the non-zero match score.

Let me know what do you think.

Yan

Hi,

Apologies for the late reply.

After talking with @jeizenga I got convinced of this. Thank you very much to both of you. This idea is truly awesome.

Cheers,