Unexpected behavior of ksw_extd when tlen > qlen + w
Opened this issue · 0 comments
I am comparing the behavior of ksw_extd()
and ksw_extd2_sse()
. One difference I noticed is the case when tlen > qlen + w
, i.e. when target is long enough for some rows of the matrix to not contain a band while when doing a "full" alignment (KSW_EZ_EXTZ_ONLY
not specified). In that case backtracking starts at position (tlen-1, qlen-1)
which is outside of band.
For ksw_extd2_sse()
in that case this condition is true:
Line 129 in 4e0a1cc
As
force_state == 1
this condition is true:Line 141 in 4e0a1cc
so
i
(target index) keeps being reduced until tlen == qlen + w
, at which point we are within the band and a good (although likely not optimal) alignment is found.
For ksw_extd()
in this case this condition is true:
Line 132 in 4e0a1cc
followed by this one:
Line 143 in 4e0a1cc
so in this case
j
(query index) gets reduced and we are effectively moving away from the band until we reach qlen == 0
, after which we move upwards, thus only generating a lot of insertions followed by a lot of deletions.
I would say that in this case ksw_backtrack()
should be edited so that for is_rot == false
instead of
if (j < off[i]) force_state = 2;
if (off_end && j > off_end[i]) force_state = 1;
there should be
if (j < off[i]) force_state = 1;
if (off_end && j > off_end[i]) force_state = 2;
In this case we would properly move upwards (reducing target index) until reaching the band, like for ksw_extd2_sse()
.
The same argument holds for cases when qlen > tlen + w
, i.e. starting element is right of the band. In that case ksw_extd2_sse()
performs insertions (reduces query index) until it reaches the band, whereas ksw_extd()
decreases target index index until reaching tlen == 0
, after which it just decreases query index, again effectively doing a lot of deletions follow by insertions, completely avoiding the band. The change I proposed above would also solve this issue.
The case when qlen > tlen + w
is actually even worse for ksw_extd()
as it does not pass off_end
:
Line 170 in 4e0a1cc
This means that this condition
Line 133 in 4e0a1cc
is never true so on this line
Line 134 in 4e0a1cc
accesses a wrong element (never set or even out of bound) of
p
is accessed.To fix this issue one would have to generate
off_end
and pass it, which is a relatively easy task.
To test this I created a reproducer. It uses w==3
and qlen==6
and both query and target have all the same basepairs.
For tlen==9
CIGAR has 3 deletions and 6 matches, as expected. For tlen==10
it generates 10 deletions and 6 insertions, in accordance with what I explained above, but significantly worse than 4 deletions and 6 matches it could generate.
For ksw_extd2_sse()
I did not manage to get the expected result because this check
Line 134 in 4e0a1cc
leads to a zdrop even when zdrop is disabled (
zdrop < 0
) (not sure if this is by design, but I am planning to file a separate issue). If this check is deleted the generated CIGAR has 4 deletions and 6 matches, as expected.
To run the reproducer copy the code from below into main.cpp
and copy ksw2.h
, ksw2_extd.c
and ksw2_extd2_sse.c
to the same folder. Then build with g++ --std=c++17 main.cpp ksw2_extd.c ksw2_extd2_sse.c -o ksw2_repro
and execute with ./ksw2_repro
#include <iostream>
#include <vector>
#include "ksw2.h"
void print_cigar(const ksw_extz_t& ez)
{
if (ez.zdropped == 0) std::cerr << "Not zdropped" << std::endl;
else std::cout << "ZDROPPED!" << std::endl;
std::cout << "CIGAR string (operation, length):" << std::endl;
for(uint32_t cigar_idx = 0; cigar_idx < ez.n_cigar; ++cigar_idx)
{
const uint32_t operation = ez.cigar[cigar_idx] & 0xF;
const uint32_t length = ez.cigar[cigar_idx] >> 4;
switch (operation)
{
case 0: std::cout << "match: "; break;
case 1: std::cout << "insertion: "; break;
case 2: std::cout << "deletion: "; break;
case 3: std::cout << "intron: "; break;
default: std::cout << "UNKNOWN OPERATION! ";
}
std::cout << length << std::endl;
}
}
int main()
{
void* km = nullptr;
const std::vector<uint8_t> query(6, 0);
const std::vector<uint8_t> target(10, 0);
std::cout << "Query length: " << query.size() << std::endl;
std::cout << "Target length: " << target.size() << std::endl << std::endl;
constexpr int8_t m = 5;
const std::vector<int8_t> mat = {+2, -4, -4, -4, -1,
-4, +2, -4, -4, -1,
-4, -4, +2, -4, -1,
-4, -4, -4, +2, -1,
-1, -1, -1, -1, -1};
constexpr int8_t gapo = 4;
constexpr int8_t gape = 2;
constexpr int8_t gapo2 = 24;
constexpr int8_t gape2 = 1;
constexpr int32_t w = 3;
constexpr int32_t zdrop = -1;
constexpr int32_t flag = 0;
std::cout << "** Green's formulation (ksw_extd) **" << std::endl;
ksw_extz_t ez_green;
ksw_reset_extz(&ez_green);
ez_green.m_cigar = 2;
ez_green.cigar = (uint32_t*) malloc(ez_green.m_cigar * sizeof(uint32_t));
ksw_extd(km,
query.size(),
query.data(),
target.size(),
target.data(),
m,
mat.data(),
gapo,
gape,
gapo2,
gape2,
w,
zdrop,
flag,
&ez_green);
print_cigar(ez_green);
std::cout << std::endl << "** Suzuki's formulation (ksw_extd2_sse) **" << std::endl;
ksw_extz_t ez_suzuki;
ksw_reset_extz(&ez_suzuki);
ez_suzuki.m_cigar = 2;
ez_suzuki.cigar = (uint32_t*) malloc(ez_suzuki.m_cigar * sizeof(uint32_t));
ksw_extd2_sse(km,
query.size(),
query.data(),
target.size(),
target.data(),
m,
mat.data(),
gapo,
gape,
gapo2,
gape2,
w,
zdrop,
0,
flag,
&ez_suzuki);
print_cigar(ez_suzuki);
free(ez_green.cigar);
free(ez_suzuki.cigar);
return 0;
}