smarco/WFA-paper

Backtrace

luisacicolini opened this issue · 3 comments

Good afternoon!
We are trying to implement this algorithm without using the backtrace. To compare and evaluate the results we get, we need to calculate the number of insertions/deletions/mismatches: in the paper you wrote that this can be done, for matches, by calculating the "difference between the actual offset and the source". Does this work for insertions and deletions as well? Can we calculate the number of deletions and insertions by using the difference between the offset and the source within the I and D wavefront components?
Thanks a lot!

Hi,

I understand that you want to compute just the score, and avoid storing the intermediate wavefronts (saving a lot of memory). Right? This is a natural instantiation of the general WFA. And we will cover it in the next release.

we need to calculate the number of insertions/deletions/mismatches: in the paper you wrote that this can be done,

In any case, to check that your own implementation is correct, you just have to compare scores.

in the paper you wrote that this can be done, for matches, by calculating the "difference between the actual offset and the source".

In the paper, we state how to trace-back the individual operations (Mismatch, Insertion, Deletion, Match) from a given wavefront.

Can we calculate the number of deletions and insertions by using the difference between the offset and the source within the I and D wavefront components?

Sure, but for that, you will need to explicitly store the wavefront vectors. Have a look at "gap_affine/affine_wavefront_backtrace.c"

It could also be that I am completely misunderstanding the question here.
Let me know. Best,

Hi,

First of all thank you for answering.

At the moment we are comparing the total number of wavefronts (as extracted from the affine_wavefront_align function) and the one we obtain, without considering the number of matches/insertions/deletions, thus without computing the backtrace. Do you think this is enough to ensure that the outputs are the same?

Well, as always, it is not a sufficient condition (but necessary). That is, running a sufficiently large input (millions of sequences), for different error-rates, and getting the correct score (total number of wavefronts) is a good start :-)