jeffdaily/parasail

Feature request: option for no gap penalty in target start and end of both target/query

traversc opened this issue · 15 comments

Here's a quick example of what I think would be useful:

#include <Rcpp.h>
#include "parasail.h"
using namespace Rcpp;

// [[Rcpp::export]]
void example_test() {
  const parasail_matrix_t *smat = parasail_matrix_lookup("nuc44");
  std::string seqA =     "CCGTACGTACGTCCCCC"; // recombinant gene query
  std::string seqB = "GGGGCCGTACGTACGTGGGGG"; // reference database
  parasail_result_t* result = parasail_sg_qe_db_trace(seqA.data(), seqA.size(), seqB.data(), seqB.size(), 2, 2, smat);
  parasail_traceback_generic(seqA.data(), seqA.size(), seqB.data(), seqB.size(), "Query:", "Target:", smat, result, '|', '*', '*', 60, 7, 1);
  parasail_result_free(result);
}

/*** R
example_test()
*/
Target:       1 GGGGCCGTACGTACGTGGGGG-----      21
                    ||||||||||||          
Query:        1 ----CCGTACGTACGT-----CCCCC      17

Length: 26
Identity:        12/26 (46.2%)
Similarity:      12/26 (46.2%)
Gaps:            14/26 (53.8%)
Score: 50

This is almost what I'd want, but the desired output would be a score of 60.

The biological use case would be if I was trying to identify a recombinant gene or a chromosome translocation, where I expect the end of the alignment to not match up (or vice versa, the beginning).

Would it be possible to implement something like this?

Thanks for the great alignment package.

Does local alignment not produce the intended result, e.g., parasail_sw_trace?

Not quite, a local align doesn't penalize the beginning of the query if there is a mismatch.

Prior to version 2.4, there was only a single implementation of semi-global alignment that would not penalize beginning or end gaps from both the query and database sequence. That implementation still exists -- could you try it? Perhaps that is what you are requesting?

parasail_sg_trace -- notably without any modifiers like qe_db.

Do note that it can lead to some ambiguity when trying to determine the best alignment if the same score is possible using the entire query or only a subsequence of the query.

I apologize if we're not communicating clearly. From the table below, do any of the semi-global modes match what you're requesting? If not, could you perhaps follow the example of the table and insert the kind of semi-global mode you are requesting?

Gaps are penalty-free at Function Name
beginning of s1/query sg_qb
end of s1/query sg_qe
beginning and end of s1/query sg_qx
beginning of s2/database sg_db
end of s2/database sg_de
beginning and end of s2/database sg_dx
beginning of s1/query and end of s2/database sg_qb_de
beginning of s2/database and end of s1/query sg_qe_db
beginning and end of both sequences sg (original, unchanged for backwards compatibility)

Sorry, I'll try to explain more clearly. In the nomenclature in the table, what I'm interested in is sg_qe_dx.

Gaps are penalty-free at Function Name
beginning and end of s2/database and end of s1/query sg_qe_dx

Here's a reworked example -- same as the original, but with a mismatch at the beginning of the gene query.

  std::string seqA =     "ACGTACGTACGTCCCCC"; // recombinant gene query
  std::string seqB = "GGGGCCGTACGTACGTGGGGG"; // reference database

Output using sg:

Target:       1 GGGGC-CGTACGTACGT-----GGGGG      21
                      |||||||||||          
Query:        1 -----ACGTACGTACGTCCCCC-----      17
Score: 43

Output using sw:

Target:       6 CGTACGTACGT      16
                |||||||||||
Query:        2 CGTACGTACGT      12
Score: 55

Output using sg_qe_db:

Target:       1 GGGGC-CGTACGTACGTGGGGG-----      21
                      |||||||||||          
Query:        1 -----ACGTACGTACGT-----CCCCC      17
Score: 43

Desired output:

Target:       1 GGGGC-CGTACGTACGT      16
                      |||||||||||          
Query:        1 -----ACGTACGTACGT      12
Score: 53

Would you be comfortable building a feature branch? I have made a first attempt at implementing what you have requested.

git clone -b feature/more_sg https://github.com/jeffdaily/parasail.git
cd parasail
autoremake -fi
mkdir build
cd build
../configure
make -j $(nproc)
make -j $(nproc) check
./tests/test_69 # new test for this issue number

You can update the tests/test_69.c code as-needed. Using your code example from this issue, I'm still getting the wrong score for the new function. Now to debug.

I'm having trouble installing that branch. I'm not too familiar with build tools. It says "autoremake not found". Am I supposed to substitute $(nproc) for a number?

Thanks, I got it to work. The output seems to always be identical to "sg_qe_db".

I'm glad you got the build to work. This will help since you can evaluate my changes. I hope you have patience; this isn't going to be a quick/easy change.

I think I figured out how to produce what you want, but it does change the way semi-global is implemented internally. The original semi-global mode ("sg") would not penalize both ends of both sequences. This lead to ambiguous cases that I resolved somewhat arbitrarily -- selecting the result that gave the shortest alignment. When I added all the new semi-global routines for the 2.4 release, they were naturally deterministic since they only penalized at most one of the two sequence beginnings or ends.

What you are requesting is a local alignment interpretation when either both sequence beginnings or both sequence ends are not penalized. I understand that now. And from your "desired output" in an earlier comment, I was able to trace some of my debugging output to see how to make that happen. But I did warn you that this is likely not a quick/easy change, so please be patient. Thank you.

@traversc I need your help with a new idea. I'm trying to preserve old behavior for maximum backwards compatibility, but I also see the value in your request. I'm thinking of adding even more new semi-global routines. Basically, instead of trying to disambiguate when we are penalizing the both beginnings or both ends, we take the "local" ("l", lowercase "L") approach. Here's a new table, and note the new column, new routines, and new names:

query begin query end db begin db end stop local Function Name Implemented
x sg_qb ✔️
x sg_qe ✔️
x sg_db ✔️
x sg_de ✔️
x x sg_qx ✔️
x x sg_dx ✔️
x x sg_qb_db
x x sg_qb_de ✔️
x x sg_qe_db ✔️
x x sg_qe_de
x x x sg_qx_db
x x x sg_qx_de
x x x sg_qb_dx
x x x sg_qe_dx
x x x x sg (original, unchanged for backwards compatibility) ✔️
x x x sgl_qb_db
x x x sgl_qe_de
x x x x sgl_qx_db
x x x x sgl_qx_de
x x x x sgl_qb_dx
x x x x sgl_qe_dx
x x x x x sgl

That looks quite comprehensive, but I don't fully understand the meaning of "stop local". E.g., how would "sgl" differ from Smith-Waterman (sw)?

I think it would be easiest to grok with examples contrasting the different approaches.

I'm having trouble understanding all the different approaches, as well. I don't have the cycles to spend a ton of time on this. Perhaps the name "stop local" is just poorly chosen.

Except for the original semi-global implementation "sg", the first implementation(s) of semi-global were straightforward. Understanding that sequence alignment is a dynamic programming table:

  • Not penalizing query begin gaps? Set initial condition column to 0.
  • Not penalizing target begin gaps? Set initial condition row to 0.
  • Not penalizing query end gaps? Select highest score from last column.
  • Not penalizing target end gaps? Select highest score from last row.

The hard part now is we are introducing ambiguity.

  • Not penalizing query and target begin gaps? What does that look like? Same initial conditions?
  • Not penalizing query and target end gaps? Should we select the highest score from the table like we do with local alignment?

As far as rendering the traceback goes, it was also straightforward when there was no ambiguity. There would always be at least the query or target without a beginning and/or gap penalty, so it made sense to me to render the entire alignment like we do with global alignment tracebacks. But with these ambiguous cases we are introducing, as pointed out by your original request, it was printing a series of insertions followed by a series of deletions or vice versa -- net very useful.

The "stop local" caused my implementation to select the highest score in the table as the answer when not penalizing end gaps from both sequences. In the cases where we weren't penalizing both beginning gaps, it was intended to signal to my traceback function that it should clip the wasteful all-indels output from the beginning of the traceback.

When I was playing around with the implementation of "sgl" versus plain local alignment, the only difference I could tell was that "sgl" allowed the scoring table to become negative instead of clamping the score to >= 0. In the resulting SW traceback, starting from the highest score in the table, when a 0 is found, the traceback stops. I wasn't sure what the "sgl" traceback should look like. Otherwise, yes, it was exactly the same as local alignment besides the possible negative score during the calculation.

I don't see how sgl can produce negative values based on the description, but I understand that these are simply variations on the boundary conditions and restrictions of the dynamic programming matrix.

I found an excel solver, which I could understand and modify to what I think should be "sg(l?)_qb_dx":

https://i.imgur.com/xwb0Cf0.png

This gives the expected alignment based on the traceback:

Query: -----ACGTACGTACGT
Ref__: GGGGC-CGTACGTACGT

I'm not really sure if that helps, but for the question: "Not penalizing query and target end gaps? Should we select the highest score from the table like we do with local alignment?"

I think the answer is "yes",but I also think that automatically follows from the two previous statements.

"Not penalizing query end gaps? Select highest score from last column." +
"Not penalizing target end gaps? Select highest score from last row."

=
Select highest score anywhere.

Similarly, if we took the reverse complement of the example and did "sg_qx_de" you'd get:

https://i.imgur.com/KZEiAZm.png

Which corresponds to the alignment:

Query: ACGTACGTACGT[-----]
Ref__: ACGTACGTACG-[GCCCC]

Hope I'm making sense :o

What I meant about sgl producing negative numbers is that in the middle of the alignment algorithm the scores are allowed to be negative. Unlike with SW where scores are never allowed to become less than 0 even in the middle of the alignment.

Thank you for the pictures of the score tables.