smarco/WFA2-lib

Speed up WFA when two sequences differ greatly in length

lh3 opened this issue · 3 comments

lh3 commented

When patching gaps between two anchors, we sometimes need to align two sequences of vastly different lengths. Suppose we are aligning a 10bp sequence against a 100kb sequence. The active band [lo,hi] in the original WFA will grow from 1 to 100010 in size. The total number of iterations is about 1000102/2. Based on the running time of WFA2-lib, I guess WFA2-lib has a similar behavior. This is not necessary given that a wavefront can only consist of 10 cells in this example.

Generally, a key observation is that the active band is determined by the cells in the current stripe (the wfalm terminology). Sometimes we can decrease "hi" or increase "lo" when the wavefront hits the end of the query or the target. I added a hack to miniwfa to achieve that. For 10bp vs 100kb, WFA2-lib takes 44 sec while the modified miniwfa takes 0.2 sec. ksw2 only takes 0.01s mainly because WFA doesn't have an advantage on such examples and partly because my hack is inefficient. I think there should be a faster and cleaner solution but I haven't found that.

Is this in the context of global or semi-global alignment (I suppose global given that you're aligning between anchors)?

Could you explain the code you added a bit more? It looks quadratic to me but surely that's just my misunderstanding. Is the following correct:
WFA keeps diagonals that already reached the end active, and (tries to) update them on every increment of s. Your patch decreases the band whenever this happens, so that only diagonals that haven't reached the end yet are updated.

lh3 commented

Is this in the context of global or semi-global alignment (I suppose global given that you're aligning between anchors)?

It is global.

Could you explain the code you added a bit more?

Function wf_stripe_shrink() finds the smallest lower bound and the largest upper bound among cells in the current stripe. This function is called every 64 cycles.

It looks quadratic to me

Let q=max(x,o1+e1,o2+e2). Then wf->n = q+1. Perhaps it is safer to swap the order of for (j) and for (d). In that case, the time complexity of calling the function once is O(qw+q) where w is the change of the bandwidth. Most of time, w=0. When w>0, the following iterations will be faster. With the current loop, the time complexity can be O(qw+q2) in the worst case, I think.

PS:

WFA keeps diagonals that already reached the end active, and (tries to) update them on every increment of s. Your patch decreases the band whenever this happens, so that only diagonals that haven't reached the end yet are updated.

Almost correct, except that the "whenever" part – my implementation only calls the function when s%64==0 as most of time this function has no effect.

Sorry for the late reply.

I think I understand the situation you are presenting. In the case of the WFA2-lib, it calls the function wavefront_compute_trim_ends() every time a new wavefront is computed.

This should have a similar effect as your wf_stripe_shrink() (?)

For 10bp vs 100kb, WFA2-lib takes 44 sec while the modified miniwfa takes 0.2 sec

But it clearly isn't. So I must have made an implementation mistake here.

mainly because WFA doesn't have an advantage on such examples

Right. In some extreme corner cases, like this one, the WFA doesn't have the upper-hand.

the modified miniwfa takes 0.2 sec. ksw2 only takes 0.01s

Although the time measurement is too small to make an assessment, this case invites to make a fine profile of this case and see what is going on.

I'm leaving this open until I can address this case.
Thanks for the feedback.