[Alignment] Wrong alignment score
yceh opened this issue · 8 comments
Does this problem persist on the current master?
- I have verified the issue on the current master
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
For the example attached, seqan report an alignment score of 34240, but the score of the alignment returned is 34202.
Expected Behavior
The reported alignment score and the score calculated from the returned alignment should match.
The attached code prints the alignment score at each column, for a majority of other alignments, the alignment score at the last column matches the alignment score reported by seqan, but it is still entirely possible that I didn't calculate the score correctly or I misunderstood the doc. (and for this particular example, they will match if I trim away the last nucleotide of both reference and query)
Steps To Reproduce
The following code should produce this issue:
#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <utility>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <vector>
int main(int argc,char** argv){
using namespace seqan3::literals;
seqan3::dna4_vector ref="CCCTATCGTAGGTTGTCAACCACCAGGTATACTTGAGGATCCTTCGTCGGACATTTATCACGTATCTCATATAGGTTGCTACTCATGGGACAGTATGGCTACTGGCAGGGTTGCAAAGTTATTCCATTCTCTACGAGTTCACGTAAGGTTTGGATGCTACTCACCAGATATCTGTTAAATGGCGCCCCGCGTATACCGTGGTTCCCAACTTAGCAAGTTTATTTAATGGCAAGCATGGAGCCTCGGACCCTCCCGGAGGACTTAAGTAGGATGGTAGCGCTGCGGTTGCGTGACGACCCGTCGACGTAAAAACATTTTCTTTACGGGAAGCTAGACTACGGACGCTGTCACAAAATATCGGAAGTGACGCGCCTGGCATATATCCTTGGCGTTAACGGAACCCCCAGTGTGACGTTATTGCGGGATCTTATAATGGAGTTGTGAGCAGACCGCTTGTTTGTTCGACCGGGCAAGGGAAAAAAAAAAAAAAAAAAAAAAAA"_dna4;;
seqan3::dna4_vector query="GCCCCACTCGTAGAGTGTCCACCAGGTACTAGGAAGACGCTTCATGCGCTTTTCCTACATGAATTGGCAGAATTCATGGACAGTTCAATGGCACTGGACAGTGTTCTTTGGCTTAAAGCTTGATATCCGAATTTCTCTACTGAGTCTGTCCTGTAAGGTTGAGTGTCGGATGCTCTCATCATATCTGTACATGGAACGCGTCCCCCGGTATCCTGCGGTCCCCAACTGTACAGTTATTTTATGTGCAACAGCATTCGGAGGCCCTCGACCTCCTGCGACTTAAGTAGAGGCGCCATGCGTGTGAATCGACACCTACCGCGACTAAAAAACTTTACTCTGGGCACGGGATACACTACGGAACCTGTCACTAAATGACGGAGAAGGCTGATGAATCCGCGGCAACCTACTAGTTAAGGCGGAGAAAGGAGACAGGTTTGCGGGATTTCATATTGGGAAGAGTTAGTGACAGCCCTTTTTGTTCAATATCCCCGCGAAAGGCT"_dna4;
seqan3::dna4 letter;
seqan3::nucleotide_scoring_scheme<int> nu_scheme;
nu_scheme.score('A'_dna4,'A'_dna4)=145;
nu_scheme.score('A'_dna4,'T'_dna4)=-144;
nu_scheme.score('A'_dna4,'G'_dna4)=-153;
nu_scheme.score('A'_dna4,'C'_dna4)=-152;
nu_scheme.score('T'_dna4,'A'_dna4)=-144;
nu_scheme.score('T'_dna4,'T'_dna4)=144;
nu_scheme.score('T'_dna4,'G'_dna4)=-143;
nu_scheme.score('T'_dna4,'C'_dna4)=-140;
nu_scheme.score('G'_dna4,'A'_dna4)=-153;
nu_scheme.score('G'_dna4,'T'_dna4)=-143;
nu_scheme.score('G'_dna4,'G'_dna4)=144;
nu_scheme.score('G'_dna4,'C'_dna4)=-145;
nu_scheme.score('C'_dna4,'A'_dna4)=-152;
nu_scheme.score('C'_dna4,'T'_dna4)=-140;
nu_scheme.score('C'_dna4,'G'_dna4)=-145;
nu_scheme.score('C'_dna4,'C'_dna4)=144;
int gap_open_cost=-52;
int gap_extend_cost=-58;
auto new_gap_cost=gap_extend_cost+gap_open_cost;
// Configure the alignment kernel.
auto config =
seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{nu_scheme}|
seqan3::align_cfg::gap_cost_affine{
seqan3::align_cfg::open_score{gap_open_cost},
seqan3::align_cfg::extension_score{gap_extend_cost}};
auto results = seqan3::align_pairwise(std::tie(ref, query), config);
auto & res = *results.begin();
seqan3::debug_stream<<"aligned_ref:\n"<<std::get<0>(res.alignment())<<"\n";
seqan3::debug_stream<<"aligned_query:\n"<<std::get<1>(res.alignment())<<"\n";
auto seq1 = std::get<0>(res.alignment());
auto seq2_iter = std::get<1>(res.alignment()).begin();
int score_so_far = 0;
int q_pos = 0;
int r_pos = 0;
enum {M,I,D} state=M;
seqan3::debug_stream<<"seq1\tseq2\tq_pos\tr_pos\tscore\n";
for (auto seq1_iter = seq1.begin(); seq1_iter < seq1.end(); seq1_iter++) {
if (*seq1_iter == seqan3::gap{}) {
q_pos++;
if (state==D) {
score_so_far += gap_extend_cost;
} else {
score_so_far += new_gap_cost;
}
state=D;
} else if (*seq2_iter == seqan3::gap{}) {
r_pos++;
if (state==I) {
score_so_far += gap_extend_cost;
} else {
score_so_far += new_gap_cost;
}
state=I;
} else {
r_pos++;
q_pos++;
seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq1_ele =
*seq1_iter;
seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq2_ele =
*seq2_iter;
score_so_far += nu_scheme.score(seq1_ele.convert_to<seqan3::dna4>(),
seq2_ele.convert_to<seqan3::dna4>());
state=M;
}
seqan3::debug_stream<<*seq1_iter<<'\t'<<*seq2_iter<<'\t'<<q_pos<<'\t'<<r_pos<<'\t'<<score_so_far<<'\n';
seq2_iter++;
}
seqan3::debug_stream<<"alignemnt score in result:"<<res.score()<<"\n";
return 0;
}
Environment
- Operating system:Ubuntu 20.04
- SeqAn version:e51658dfa5ae7ee932cecbc41eb4d16c710d1f07
- Compiler:gcc 10.3.0
Anything else?
No response
Hi @yceh,
Thank you for submitting the bug report.
I could reduce the sequences to "GGCAAGAA" and "CGAAGC".
Alignment:
0 . :
GGCAAGAA---
| | |
--C--G-AAGC
Score Matrix:
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;0 ;-110;-168;-226;-284;-342;-400;-458;-516;
C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366;
G;-168;34 ;-1 ;-111;-169;-227;-48 ;-158;-216;
A;-226;-76 ;-111;-153;34 ;-24 ;-134;97 ;-13 ;
A;-284;-134;-169;-250;-8 ;179 ;69 ;11 ;242 ;
G;-342;-140;10 ;-100;-118;69 ;323 ;213 ;155 ;
C;-400;-250;-100;154 ;44 ;11 ;213 ;171 ;74 ;
Trace Matrix:
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;N ;L ;l ;l ; ; ; ; ; ;
C;U ;DUL ;DUL ;DUl ;L ;l ;l ;l ;l ;
G;u ;DUL ;DuL ;L ;l ;l ;DUl ;L ;l ;
A;u ;UL ;UL ;DuL ;DUL ;DUL ;l ;DUl ;DUL ;
A;u ;uL ;uL ;uL ;DUl ;DUL ;L ;DUl ;DUl ;
G;u ;DuL ;DuL ;L ;Ul ;Ul ;DUL ;L ;l ;
C;u ;uL ;UL ;DUL ;L ;ul ;Ul ;DUL ;uL ;
Path:
u;u;u;d;l;d;l;l;d;l;l;
Score Matrix (Just Path):
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;0 ;-110;-168; ; ; ; ; ; ;
C; ; ; ;-24 ;-134;-192; ; ; ;
G; ; ; ; ; ; ;-48 ;-158; ;
A; ; ; ; ; ; ; ; ;-13 ;
A; ; ; ; ; ; ; ; ;242 ;
G; ; ; ; ; ; ; ; ;155 ;
C; ; ; ; ; ; ; ; ;74 ;
Trace Matrix (Just Path):
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;N ;L ;l ; ; ; ; ; ; ;
C; ; ; ;D ;L ;l ; ; ; ;
G; ; ; ; ; ; ;D ;L ; ;
A; ; ; ; ; ; ; ; ;D ;
A; ; ; ; ; ; ; ; ;U ;
G; ; ; ; ; ; ; ; ;u ;
C; ; ; ; ; ; ; ; ;u ;
seq1 seq2 state q_pos r_pos score
G - INS 0 1 -110 SAME SCORE
G - INS 0 2 -168 SAME SCORE
C C MAT 1 3 -24 SAME SCORE
A - INS 1 4 -134 SAME SCORE
A - INS 1 5 -192 SAME SCORE
G G MAT 2 6 -48 SAME SCORE
A - INS 2 7 -158 SAME SCORE
A A MAT 3 8 -13 SAME SCORE
- A DEL 4 8 -123 DIFFERENT SCORE
- G DEL 5 8 -181 DIFFERENT SCORE
- C DEL 6 8 -239 DIFFERENT SCORE
If you add
#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>
you can print the alignment:
auto & res = *results.begin();
seqan3::debug_stream << "alignment:\n" << res << "\n";
I found that changing the scoring scheme type to int8_t
(default) results in the same score.
seqan3::nucleotide_scoring_scheme<int8_t>
Mind that with int8_t
the scores underflow. Haven't yet tried changing the scores to something in [-128,127]
Something goes wrong with the traceback.
The score is actually correct.
0 . :
GGCAAGAA---
| | |
--C--G-AAGC
The score for this alignment is -239.
Is not optimal, because you could align both A:
0 . :
GGCAAGAA--
| | |
--C--GAAGC
The score for this alignment is 74
.
https://godbolt.org/z/qrePrW67Y
Some debug code
#include <utility>
#include <vector>
#include <seqan3/alignment/aligned_sequence/debug_stream_alignment.hpp>
#include <seqan3/alignment/configuration/align_config_gap_cost_affine.hpp>
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/matrix/detail/debug_matrix.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alignment/scoring/nucleotide_scoring_scheme.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
int main(int argc, char ** argv)
{
using namespace seqan3::literals;
seqan3::dna4_vector ref = "GGCAAGAA"_dna4;
seqan3::dna4_vector query = "CGAAGC"_dna4;
seqan3::dna4 letter;
seqan3::nucleotide_scoring_scheme<int> nu_scheme;
nu_scheme.score('A'_dna4, 'A'_dna4) = 145;
nu_scheme.score('A'_dna4, 'T'_dna4) = -144;
nu_scheme.score('A'_dna4, 'G'_dna4) = -153;
nu_scheme.score('A'_dna4, 'C'_dna4) = -152;
nu_scheme.score('T'_dna4, 'A'_dna4) = -144;
nu_scheme.score('T'_dna4, 'T'_dna4) = 144;
nu_scheme.score('T'_dna4, 'G'_dna4) = -143;
nu_scheme.score('T'_dna4, 'C'_dna4) = -140;
nu_scheme.score('G'_dna4, 'A'_dna4) = -153;
nu_scheme.score('G'_dna4, 'T'_dna4) = -143;
nu_scheme.score('G'_dna4, 'G'_dna4) = 144;
nu_scheme.score('G'_dna4, 'C'_dna4) = -145;
nu_scheme.score('C'_dna4, 'A'_dna4) = -152;
nu_scheme.score('C'_dna4, 'T'_dna4) = -140;
nu_scheme.score('C'_dna4, 'G'_dna4) = -145;
nu_scheme.score('C'_dna4, 'C'_dna4) = 144;
int gap_open_cost = -52;
int gap_extend_cost = -58;
auto new_gap_cost = gap_extend_cost + gap_open_cost;
// Configure the alignment kernel.
auto config = seqan3::align_cfg::method_global{} | seqan3::align_cfg::scoring_scheme{nu_scheme}
| seqan3::align_cfg::gap_cost_affine{seqan3::align_cfg::open_score{gap_open_cost},
seqan3::align_cfg::extension_score{gap_extend_cost}}
| seqan3::align_cfg::detail::debug{};
auto results = seqan3::align_pairwise(std::tie(ref, query), config);
auto & res = *results.begin();
seqan3::debug_stream << res.alignment() << std::endl;
seqan3::debug_stream << seqan3::detail::debug_matrix{res.score_matrix()} << std::endl;
seqan3::debug_stream << seqan3::detail::debug_matrix{res.trace_matrix()} << std::endl;
// seqan3::debug_stream<<"aligned_ref:\n"<<std::get<0>(res.alignment())<<"\n";
// seqan3::debug_stream<<"aligned_query:\n"<<std::get<1>(res.alignment())<<"\n";
auto seq1 = std::get<0>(res.alignment());
auto seq2_iter = std::get<1>(res.alignment()).begin();
int score_so_far = 0;
int q_pos = 0;
int r_pos = 0;
enum
{
M,
I,
D
} state = M;
seqan3::debug_stream << "seq1\tseq2\tstate\tq_pos\tr_pos\tscore\n";
for (auto seq1_iter = seq1.begin(); seq1_iter < seq1.end(); seq1_iter++)
{
if (*seq1_iter == seqan3::gap{})
{
q_pos++;
if (state == D)
{
score_so_far += gap_extend_cost;
}
else
{
score_so_far += new_gap_cost;
}
state = D;
}
else if (*seq2_iter == seqan3::gap{})
{
r_pos++;
if (state == I)
{
score_so_far += gap_extend_cost;
}
else
{
score_so_far += new_gap_cost;
}
state = I;
}
else
{
r_pos++;
q_pos++;
seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq1_ele = *seq1_iter;
seqan3::alphabet_variant<seqan3::dna4, seqan3::gap> seq2_ele = *seq2_iter;
score_so_far += nu_scheme.score(seq1_ele.convert_to<seqan3::dna4>(), seq2_ele.convert_to<seqan3::dna4>());
state = M;
}
seqan3::debug_stream << *seq1_iter << '\t' << *seq2_iter << '\t'
<< (state == D ? "DEL" : (state == I ? "INS" : "MAT")) << '\t' << q_pos << '\t' << r_pos
<< '\t' << score_so_far << '\n';
seq2_iter++;
}
seqan3::debug_stream << "alignemnt score in result:" << res.score() << ", " << score_so_far << "\n";
return res.score() == score_so_far ? 0 : -1;
}
Alignment:
0 . :
GGCAAGAA---
| | |
--C--G-AAGC
Score Matrix:
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;0 ;-110;-168;-226;-284;-342;-400;-458;-516;
C;-110;-145;-255;-24 ;-134;-192;-250;-308;-366;
G;-168;34 ;-1 ;-111;-169;-227;-48 ;-158;-216;
A;-226;-76 ;-111;-153;34 ;-24 ;-134;97 ;-13 ;
A;-284;-134;-169;-250;-8 ;179 ;69 ;11 ;242 ;
G;-342;-140;10 ;-100;-118;69 ;323 ;213 ;155 ;
C;-400;-250;-100;154 ;44 ;11 ;213 ;171 ;74 ;
Trace Matrix:
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;N ;L ;l ;l ; ; ; ; ; ;
C;U ;DUL ;DUL ;DUl ;L ;l ;l ;l ;l ;
G;u ;DUL ;DuL ;L ;l ;l ;DUl ;L ;l ;
A;u ;UL ;UL ;DuL ;DUL ;DUL ;l ;DUl ;DUL ;
A;u ;uL ;uL ;uL ;DUl ;DUL ;L ;DUl ;DUl ;
G;u ;DuL ;DuL ;L ;Ul ;Ul ;DUL ;L ;l ;
C;u ;uL ;UL ;DUL ;L ;ul ;Ul ;DUL ;uL ;
Path:
u;u;u;d;l;d;l;l;d;l;l;
Score Matrix (Just Path):
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;0 ;-110;-168; ; ; ; ; ; ;
C; ; ; ;-24 ;-134;-192; ; ; ;
G; ; ; ; ; ; ;-48 ;-158; ;
A; ; ; ; ; ; ; ; ;-13 ;
A; ; ; ; ; ; ; ; ;242 ;
G; ; ; ; ; ; ; ; ;155 ;
C; ; ; ; ; ; ; ; ;74 ;
Trace Matrix (Just Path):
; ;G ;G ;C ;A ;A ;G ;A ;A ;
;N ;L ;l ; ; ; ; ; ; ;
C; ; ; ;D ;L ;l ; ; ; ;
G; ; ; ; ; ; ;D ;L ; ;
A; ; ; ; ; ; ; ; ;D ;
A; ; ; ; ; ; ; ; ;U ;
G; ; ; ; ; ; ; ; ;U ;
C; ; ; ; ; ; ; ; ;u ;
seq1 seq2 state q_pos r_pos score
G - INS 0 1 -110 SAME SCORE
G - INS 0 2 -168 SAME SCORE
C C MAT 1 3 -24 SAME SCORE
A - INS 1 4 -134 SAME SCORE
A - INS 1 5 -192 SAME SCORE
G G MAT 2 6 -48 SAME SCORE
A - INS 2 7 -158 SAME SCORE
A A MAT 3 8 -13 SAME SCORE
- A DEL 4 8 -123 DIFFERENT SCORE
- G DEL 5 8 -181 DIFFERENT SCORE
- C DEL 6 8 -239 DIFFERENT SCORE
OK, I did some digging.
The Gotoh paper states (section 3 is Traceback
):
The meaning of the parameters is not that explicitly stated in the paper, but my interpretation is:
u
is the gap costL
seems to refer to the gap weighing function, we have gap open and extension, henceL=2
max d(x,y)
is the maximum score, a mismatch has a positive score
The equation seems to be equivalent to gap_open + 2 * gap_extension >= mismatch_score
in our implementation.
E.g.,
nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-144});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // does not work
nu_scheme.set_simple_scheme(seqan3::match_score{145}, seqan3::mismatch_score{-146});
int const gap_open_cost = -100;
int const gap_extend_cost = -22; // works
int const gap_extend_cost = -23; // works
In the example, the scores are more sophisticated, I think the most safe would be to go with the most positive mismatch
score. However, it works for a gap_extension of 46
, but not 47
, so only certain costs are involved for this specific alignment.
All in all, it seems that the algorithm does not work because we are in violation of this assumption made by Gotoh.
Edit:
- The constraint should be more like
2 * gap_open < max_mismatch
, i.e. 2 gaps are more expensive than a mismatch - So above reasoning doesn't hold
- Maybe we need to track being in the last column? In global alignment, we want to prevent coming from left in the last column, unless we are in the last cell.
Hey, so I looked at it today and the problem was that on very few occasions the original up open
signal was overwritten by a better score coming from the left
.
This happened here in the cell [c8:r5], where extending the gap in horizontal direction outscores the diagonal and vertical direction.
However, the optimal alignment comping from up
ended in a gap open which was not captured before.
I fixed this in #3098 by silently tracking the up open direction in case the left direction dominates.