seqan/seqan3

[Alignment] Wrong alignment score

yceh opened this issue · 8 comments

yceh commented

Does this problem persist on the current master?

  • I have verified the issue on the current master

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

For the example attached, seqan report an alignment score of 34240, but the score of the alignment returned is 34202.

Expected Behavior

The reported alignment score and the score calculated from the returned alignment should match.

The attached code prints the alignment score at each column, for a majority of other alignments, the alignment score at the last column matches the alignment score reported by seqan, but it is still entirely possible that I didn't calculate the score correctly or I misunderstood the doc. (and for this particular example, they will match if I trim away the last nucleotide of both reference and query)

Steps To Reproduce

The following code should produce this issue:

#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <utility>
 
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <vector>

int main(int argc,char** argv){
    using namespace seqan3::literals;
    seqan3::dna4_vector ref="CCCTATCGTAGGTTGTCAACCACCAGGTATACTTGAGGATCCTTCGTCGGACATTTATCACGTATCTCATATAGGTTGCTACTCATGGGACAGTATGGCTACTGGCAGGGTTGCAAAGTTATTCCATTCTCTACGAGTTCACGTAAGGTTTGGATGCTACTCACCAGATATCTGTTAAATGGCGCCCCGCGTATACCGTGGTTCCCAACTTAGCAAGTTTATTTAATGGCAAGCATGGAGCCTCGGACCCTCCCGGAGGACTTAAGTAGGATGGTAGCGCTGCGGTTGCGTGACGACCCGTCGACGTAAAAACATTTTCTTTACGGGAAGCTAGACTACGGACGCTGTCACAAAATATCGGAAGTGACGCGCCTGGCATATATCCTTGGCGTTAACGGAACCCCCAGTGTGACGTTATTGCGGGATCTTATAATGGAGTTGTGAGCAGACCGCTTGTTTGTTCGACCGGGCAAGGGAAAAAAAAAAAAAAAAAAAAAAAA"_dna4;;
    seqan3::dna4_vector query="GCCCCACTCGTAGAGTGTCCACCAGGTACTAGGAAGACGCTTCATGCGCTTTTCCTACATGAATTGGCAGAATTCATGGACAGTTCAATGGCACTGGACAGTGTTCTTTGGCTTAAAGCTTGATATCCGAATTTCTCTACTGAGTCTGTCCTGTAAGGTTGAGTGTCGGATGCTCTCATCATATCTGTACATGGAACGCGTCCCCCGGTATCCTGCGGTCCCCAACTGTACAGTTATTTTATGTGCAACAGCATTCGGAGGCCCTCGACCTCCTGCGACTTAAGTAGAGGCGCCATGCGTGTGAATCGACACCTACCGCGACTAAAAAACTTTACTCTGGGCACGGGATACACTACGGAACCTGTCACTAAATGACGGAGAAGGCTGATGAATCCGCGGCAACCTACTAGTTAAGGCGGAGAAAGGAGACAGGTTTGCGGGATTTCATATTGGGAAGAGTTAGTGACAGCCCTTTTTGTTCAATATCCCCGCGAAAGGCT"_dna4;

    seqan3::dna4 letter;

    seqan3::nucleotide_scoring_scheme<int> nu_scheme;
    nu_scheme.score('A'_dna4,'A'_dna4)=145;
    nu_scheme.score('A'_dna4,'T'_dna4)=-144;
    nu_scheme.score('A'_dna4,'G'_dna4)=-153;
    nu_scheme.score('A'_dna4,'C'_dna4)=-152;

    nu_scheme.score('T'_dna4,'A'_dna4)=-144;
    nu_scheme.score('T'_dna4,'T'_dna4)=144;
    nu_scheme.score('T'_dna4,'G'_dna4)=-143;
    nu_scheme.score('T'_dna4,'C'_dna4)=-140;
    
    nu_scheme.score('G'_dna4,'A'_dna4)=-153;
    nu_scheme.score('G'_dna4,'T'_dna4)=-143;
    nu_scheme.score('G'_dna4,'G'_dna4)=144;
    nu_scheme.score('G'_dna4,'C'_dna4)=-145;

    nu_scheme.score('C'_dna4,'A'_dna4)=-152;
    nu_scheme.score('C'_dna4,'T'_dna4)=-140;
    nu_scheme.score('C'_dna4,'G'_dna4)=-145;
    nu_scheme.score('C'_dna4,'C'_dna4)=144;

    int gap_open_cost=-52;
    int gap_extend_cost=-58;
    auto new_gap_cost=gap_extend_cost+gap_open_cost;
    // Configure the alignment kernel.
    auto config =
        seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{nu_scheme}|
        seqan3::align_cfg::gap_cost_affine{
            seqan3::align_cfg::open_score{gap_open_cost},
            seqan3::align_cfg::extension_score{gap_extend_cost}};
 
    auto results = seqan3::align_pairwise(std::tie(ref, query), config);
    auto & res = *results.begin();
    seqan3::debug_stream<<"aligned_ref:\n"<<std::get<0>(res.alignment())<<"\n";
    seqan3::debug_stream<<"aligned_query:\n"<<std::get<1>(res.alignment())<<"\n";

      auto seq1 = std::get<0>(res.alignment());
      auto seq2_iter = std::get<1>(res.alignment()).begin();

      int score_so_far = 0;
      int q_pos = 0;
      int r_pos = 0;
      enum {M,I,D} state=M;
      seqan3::debug_stream<<"seq1\tseq2\tq_pos\tr_pos\tscore\n";
      for (auto seq1_iter = seq1.begin(); seq1_iter < seq1.end(); seq1_iter++) {
        if (*seq1_iter == seqan3::gap{}) {
          q_pos++;
          if (state==D) {
            score_so_far += gap_extend_cost;
          } else {
            score_so_far += new_gap_cost;
          }
          state=D;
        } else if (*seq2_iter == seqan3::gap{}) {
          r_pos++;
          if (state==I) {
            score_so_far += gap_extend_cost;
          } else {
            score_so_far += new_gap_cost;
          }
          state=I;
        } else {
          r_pos++;
          q_pos++;
          seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq1_ele =
              *seq1_iter;
          seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq2_ele =
              *seq2_iter;
          score_so_far += nu_scheme.score(seq1_ele.convert_to<seqan3::dna4>(),
                                          seq2_ele.convert_to<seqan3::dna4>());
          state=M;
        }
        seqan3::debug_stream<<*seq1_iter<<'\t'<<*seq2_iter<<'\t'<<q_pos<<'\t'<<r_pos<<'\t'<<score_so_far<<'\n';
        seq2_iter++;
      }
      seqan3::debug_stream<<"alignemnt score in result:"<<res.score()<<"\n";
      return 0;
}

Environment

- Operating system:Ubuntu 20.04
- SeqAn version:e51658dfa5ae7ee932cecbc41eb4d16c710d1f07
- Compiler:gcc 10.3.0

Anything else?

No response

Hi @yceh,

Thank you for submitting the bug report.

I could reduce the sequences to "GGCAAGAA" and "CGAAGC".

Alignment:

    0     .    :
      GGCAAGAA---
        |  | |
      --C--G-AAGC

Score Matrix:

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;0   ;-110;-168;-226;-284;-342;-400;-458;-516;
C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366;
G;-168;34  ;-1  ;-111;-169;-227;-48 ;-158;-216;
A;-226;-76 ;-111;-153;34  ;-24 ;-134;97  ;-13 ;
A;-284;-134;-169;-250;-8  ;179 ;69  ;11  ;242 ;
G;-342;-140;10  ;-100;-118;69  ;323 ;213 ;155 ;
C;-400;-250;-100;154 ;44  ;11  ;213 ;171 ;74  ;

Trace Matrix:

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;N   ;L   ;l   ;l   ;    ;    ;    ;    ;    ;
C;U   ;DUL ;DUL ;DUl ;L   ;l   ;l   ;l   ;l   ;
G;u   ;DUL ;DuL ;L   ;l   ;l   ;DUl ;L   ;l   ;
A;u   ;UL  ;UL  ;DuL ;DUL ;DUL ;l   ;DUl ;DUL ;
A;u   ;uL  ;uL  ;uL  ;DUl ;DUL ;L   ;DUl ;DUl ;
G;u   ;DuL ;DuL ;L   ;Ul  ;Ul  ;DUL ;L   ;l   ;
C;u   ;uL  ;UL  ;DUL ;L   ;ul  ;Ul  ;DUL ;uL  ;

Path:

u;u;u;d;l;d;l;l;d;l;l;

Score Matrix (Just Path):

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;0   ;-110;-168;    ;    ;    ;    ;    ;    ;
C;    ;    ;    ;-24 ;-134;-192;    ;    ;    ;
G;    ;    ;    ;    ;    ;    ;-48 ;-158;    ;
A;    ;    ;    ;    ;    ;    ;    ;    ;-13 ;
A;    ;    ;    ;    ;    ;    ;    ;    ;242 ;
G;    ;    ;    ;    ;    ;    ;    ;    ;155 ;
C;    ;    ;    ;    ;    ;    ;    ;    ;74  ;

Trace Matrix (Just Path):

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;N   ;L   ;l   ;    ;    ;    ;    ;    ;    ;
C;    ;    ;    ;D   ;L   ;l   ;    ;    ;    ;
G;    ;    ;    ;    ;    ;    ;D   ;L   ;    ;
A;    ;    ;    ;    ;    ;    ;    ;    ;D   ;
A;    ;    ;    ;    ;    ;    ;    ;    ;U   ;
G;    ;    ;    ;    ;    ;    ;    ;    ;u   ;
C;    ;    ;    ;    ;    ;    ;    ;    ;u   ;
seq1	seq2	state	q_pos	r_pos	score
G	-	INS	0	1	-110	SAME SCORE
G	-	INS	0	2	-168	SAME SCORE
C	C	MAT	1	3	-24	SAME SCORE
A	-	INS	1	4	-134	SAME SCORE
A	-	INS	1	5	-192	SAME SCORE
G	G	MAT	2	6	-48	SAME SCORE
A	-	INS	2	7	-158	SAME SCORE
A	A	MAT	3	8	-13	SAME SCORE
-	A	DEL	4	8	-123	DIFFERENT SCORE
-	G	DEL	5	8	-181	DIFFERENT SCORE
-	C	DEL	6	8	-239	DIFFERENT SCORE

If you add

#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>

you can print the alignment:

auto & res = *results.begin();
seqan3::debug_stream << "alignment:\n" << res << "\n";

I found that changing the scoring scheme type to int8_t (default) results in the same score.

seqan3::nucleotide_scoring_scheme<int8_t>

Mind that with int8_t the scores underflow. Haven't yet tried changing the scores to something in [-128,127]

Something goes wrong with the traceback.
The score is actually correct.

    0     .    :
      GGCAAGAA---
        |  | |
      --C--G-AAGC

The score for this alignment is -239.

Is not optimal, because you could align both A:

    0     .    :
      GGCAAGAA--
        |  | |
      --C--GAAGC

The score for this alignment is 74.

https://godbolt.org/z/qrePrW67Y

Some debug code
#include <utility>
#include <vector>

#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>
#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>

int main(int argc, char ** argv)
{
    using namespace seqan3::literals;
    seqan3::dna4_vector ref = "GGCAAGAA"_dna4;
    seqan3::dna4_vector query = "CGAAGC"_dna4;

    seqan3::dna4 letter;

    seqan3::nucleotide_scoring_scheme<int> nu_scheme;
    nu_scheme.score('A'_dna4, 'A'_dna4) = 145;
    nu_scheme.score('A'_dna4, 'T'_dna4) = -144;
    nu_scheme.score('A'_dna4, 'G'_dna4) = -153;
    nu_scheme.score('A'_dna4, 'C'_dna4) = -152;

    nu_scheme.score('T'_dna4, 'A'_dna4) = -144;
    nu_scheme.score('T'_dna4, 'T'_dna4) = 144;
    nu_scheme.score('T'_dna4, 'G'_dna4) = -143;
    nu_scheme.score('T'_dna4, 'C'_dna4) = -140;

    nu_scheme.score('G'_dna4, 'A'_dna4) = -153;
    nu_scheme.score('G'_dna4, 'T'_dna4) = -143;
    nu_scheme.score('G'_dna4, 'G'_dna4) = 144;
    nu_scheme.score('G'_dna4, 'C'_dna4) = -145;

    nu_scheme.score('C'_dna4, 'A'_dna4) = -152;
    nu_scheme.score('C'_dna4, 'T'_dna4) = -140;
    nu_scheme.score('C'_dna4, 'G'_dna4) = -145;
    nu_scheme.score('C'_dna4, 'C'_dna4) = 144;

    int gap_open_cost = -52;
    int gap_extend_cost = -58;
    auto new_gap_cost = gap_extend_cost + gap_open_cost;
    // Configure the alignment kernel.
    auto config = seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{nu_scheme}
                | seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{gap_open_cost},
                                                     seqan3::align_cfg::extension_score{gap_extend_cost}}
                | seqan3::align_cfg::detail::debug{};

    auto results = seqan3::align_pairwise(std::tie(ref, query), config);
    auto & res = *results.begin();
    seqan3::debug_stream << res.alignment() << std::endl;
    seqan3::debug_stream << seqan3::detail::debug_matrix{res.score_matrix()} << std::endl;
    seqan3::debug_stream << seqan3::detail::debug_matrix{res.trace_matrix()} << std::endl;

    // seqan3::debug_stream<<"aligned_ref:\n"<<std::get<0>(res.alignment())<<"\n";
    // seqan3::debug_stream<<"aligned_query:\n"<<std::get<1>(res.alignment())<<"\n";

    auto seq1 = std::get<0>(res.alignment());
    auto seq2_iter = std::get<1>(res.alignment()).begin();

    int score_so_far = 0;
    int q_pos = 0;
    int r_pos = 0;
    enum
    {
        M,
        I,
        D
    } state = M;
    seqan3::debug_stream << "seq1\tseq2\tstate\tq_pos\tr_pos\tscore\n";
    for (auto seq1_iter = seq1.begin(); seq1_iter < seq1.end(); seq1_iter++)
    {
        if (*seq1_iter == seqan3::gap{})
        {
            q_pos++;
            if (state == D)
            {
                score_so_far += gap_extend_cost;
            }
            else
            {
                score_so_far += new_gap_cost;
            }
            state = D;
        }
        else if (*seq2_iter == seqan3::gap{})
        {
            r_pos++;
            if (state == I)
            {
                score_so_far += gap_extend_cost;
            }
            else
            {
                score_so_far += new_gap_cost;
            }
            state = I;
        }
        else
        {
            r_pos++;
            q_pos++;
            seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq1_ele = *seq1_iter;
            seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq2_ele = *seq2_iter;
            score_so_far += nu_scheme.score(seq1_ele.convert_to<seqan3::dna4>(), seq2_ele.convert_to<seqan3::dna4>());
            state = M;
        }
        seqan3::debug_stream << *seq1_iter << '\t' << *seq2_iter << '\t'
                             << (state == D ? "DEL" : (state == I ? "INS" : "MAT")) << '\t' << q_pos << '\t' << r_pos
                             << '\t' << score_so_far << '\n';
        seq2_iter++;
    }
    seqan3::debug_stream << "alignemnt score in result:" << res.score() << ", " << score_so_far << "\n";
    return res.score() == score_so_far ? 0 : -1;
}

Alignment:

    0     .    :
      GGCAAGAA---
        |  | |
      --C--G-AAGC

Score Matrix:

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;0   ;-110;-168;-226;-284;-342;-400;-458;-516;
C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366;
G;-168;34  ;-1  ;-111;-169;-227;-48 ;-158;-216;
A;-226;-76 ;-111;-153;34  ;-24 ;-134;97  ;-13 ;
A;-284;-134;-169;-250;-8  ;179 ;69  ;11  ;242 ;
G;-342;-140;10  ;-100;-118;69  ;323 ;213 ;155 ;
C;-400;-250;-100;154 ;44  ;11  ;213 ;171 ;74  ;

Trace Matrix:

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;N   ;L   ;l   ;l   ;    ;    ;    ;    ;    ;
C;U   ;DUL ;DUL ;DUl ;L   ;l   ;l   ;l   ;l   ;
G;u   ;DUL ;DuL ;L   ;l   ;l   ;DUl ;L   ;l   ;
A;u   ;UL  ;UL  ;DuL ;DUL ;DUL ;l   ;DUl ;DUL ;
A;u   ;uL  ;uL  ;uL  ;DUl ;DUL ;L   ;DUl ;DUl ;
G;u   ;DuL ;DuL ;L   ;Ul  ;Ul  ;DUL ;L   ;l   ;
C;u   ;uL  ;UL  ;DUL ;L   ;ul  ;Ul  ;DUL ;uL  ;

Path:

u;u;u;d;l;d;l;l;d;l;l;

Score Matrix (Just Path):

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;0   ;-110;-168;    ;    ;    ;    ;    ;    ;
C;    ;    ;    ;-24 ;-134;-192;    ;    ;    ;
G;    ;    ;    ;    ;    ;    ;-48 ;-158;    ;
A;    ;    ;    ;    ;    ;    ;    ;    ;-13 ;
A;    ;    ;    ;    ;    ;    ;    ;    ;242 ;
G;    ;    ;    ;    ;    ;    ;    ;    ;155 ;
C;    ;    ;    ;    ;    ;    ;    ;    ;74  ;

Trace Matrix (Just Path):

 ;    ;G   ;G   ;C   ;A   ;A   ;G   ;A   ;A   ;
 ;N   ;L   ;l   ;    ;    ;    ;    ;    ;    ;
C;    ;    ;    ;D   ;L   ;l   ;    ;    ;    ;
G;    ;    ;    ;    ;    ;    ;D   ;L   ;    ;
A;    ;    ;    ;    ;    ;    ;    ;    ;D   ;
A;    ;    ;    ;    ;    ;    ;    ;    ;U   ;
G;    ;    ;    ;    ;    ;    ;    ;    ;U   ;
C;    ;    ;    ;    ;    ;    ;    ;    ;u   ;
seq1	seq2	state	q_pos	r_pos	score
G	-	INS	0	1	-110	SAME SCORE
G	-	INS	0	2	-168	SAME SCORE
C	C	MAT	1	3	-24	SAME SCORE
A	-	INS	1	4	-134	SAME SCORE
A	-	INS	1	5	-192	SAME SCORE
G	G	MAT	2	6	-48	SAME SCORE
A	-	INS	2	7	-158	SAME SCORE
A	A	MAT	3	8	-13	SAME SCORE
-	A	DEL	4	8	-123	DIFFERENT SCORE
-	G	DEL	5	8	-181	DIFFERENT SCORE
-	C	DEL	6	8	-239	DIFFERENT SCORE

OK, I did some digging.
The Gotoh paper states (section 3 is Traceback):

gotoh

The meaning of the parameters is not that explicitly stated in the paper, but my interpretation is:

  • u is the gap cost
  • L seems to refer to the gap weighing function, we have gap open and extension, hence L=2
  • max d(x,y) is the maximum score, a mismatch has a positive score

The equation seems to be equivalent to gap_open + 2 * gap_extension >= mismatch_score in our implementation.

E.g.,

nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-144});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // does not work

nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-146});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // works

In the example, the scores are more sophisticated, I think the most safe would be to go with the most positive mismatch
score. However, it works for a gap_extension of 46, but not 47, so only certain costs are involved for this specific alignment.

All in all, it seems that the algorithm does not work because we are in violation of this assumption made by Gotoh.

Edit:

  • The constraint should be more like 2 * gap_open < max_mismatch, i.e. 2 gaps are more expensive than a mismatch
  • So above reasoning doesn't hold
  • Maybe we need to track being in the last column? In global alignment, we want to prevent coming from left in the last column, unless we are in the last cell.
rrahn commented

Hey, so I looked at it today and the problem was that on very few occasions the original up open signal was overwritten by a better score coming from the left.
This happened here in the cell [c8:r5], where extending the gap in horizontal direction outscores the diagonal and vertical direction.
However, the optimal alignment comping from up ended in a gap open which was not captured before.
I fixed this in #3098 by silently tracking the up open direction in case the left direction dominates.

We merged #3098 which should fix this issue. Please re-open if the issue has not been resolved for you.