lh3/ksw2

Unexpected behavior of ksw_extd when tlen > qlen + w

Opened this issue · 0 comments

I am comparing the behavior of ksw_extd() and ksw_extd2_sse(). One difference I noticed is the case when tlen > qlen + w, i.e. when target is long enough for some rows of the matrix to not contain a band while when doing a "full" alignment (KSW_EZ_EXTZ_ONLY not specified). In that case backtracking starts at position (tlen-1, qlen-1) which is outside of band.

For ksw_extd2_sse() in that case this condition is true:

ksw2/ksw2.h

Line 129 in 4e0a1cc

if (off_end && i > off_end[r]) force_state = 1;

As force_state == 1 this condition is true:

ksw2/ksw2.h

Line 141 in 4e0a1cc

else if (state == 1 || (state == 3 && min_intron_len <= 0)) cigar = ksw_push_cigar(km, &n_cigar, &m_cigar, cigar, 2, 1), --i; // deletion

so i (target index) keeps being reduced until tlen == qlen + w, at which point we are within the band and a good (although likely not optimal) alignment is found.

For ksw_extd() in this case this condition is true:

ksw2/ksw2.h

Line 132 in 4e0a1cc

if (j < off[i]) force_state = 2;

followed by this one:

ksw2/ksw2.h

Line 143 in 4e0a1cc

else cigar = ksw_push_cigar(km, &n_cigar, &m_cigar, cigar, 1, 1), --j; // insertion

so in this case j (query index) gets reduced and we are effectively moving away from the band until we reach qlen == 0, after which we move upwards, thus only generating a lot of insertions followed by a lot of deletions.

I would say that in this case ksw_backtrack() should be edited so that for is_rot == false instead of

if (j < off[i]) force_state = 2;
if (off_end && j > off_end[i]) force_state = 1;

there should be

if (j < off[i]) force_state = 1;
if (off_end && j > off_end[i]) force_state = 2;

In this case we would properly move upwards (reducing target index) until reaching the band, like for ksw_extd2_sse().

The same argument holds for cases when qlen > tlen + w, i.e. starting element is right of the band. In that case ksw_extd2_sse() performs insertions (reduces query index) until it reaches the band, whereas ksw_extd() decreases target index index until reaching tlen == 0, after which it just decreases query index, again effectively doing a lot of deletions follow by insertions, completely avoiding the band. The change I proposed above would also solve this issue.

The case when qlen > tlen + w is actually even worse for ksw_extd() as it does not pass off_end:

ksw_backtrack(km, 0, rev_cigar, 0, z, off, 0, n_col, tlen-1, qlen-1, &ez->m_cigar, &ez->n_cigar, &ez->cigar);

This means that this condition

ksw2/ksw2.h

Line 133 in 4e0a1cc

if (off_end && j > off_end[i]) force_state = 1;

is never true so on this line

ksw2/ksw2.h

Line 134 in 4e0a1cc

tmp = force_state < 0? p[(size_t)i * n_col + j - off[i]] : 0;

accesses a wrong element (never set or even out of bound) of p is accessed.
To fix this issue one would have to generate off_end and pass it, which is a relatively easy task.

To test this I created a reproducer. It uses w==3 and ​qlen==6 and both query and target have all the same basepairs.
For tlen==9 CIGAR has 3 deletions and 6 matches, as expected. For tlen==10 it generates 10 deletions and 6 insertions, in accordance with what I explained above, but significantly worse than 4 deletions and 6 matches it could generate.
For ksw_extd2_sse() I did not manage to get the expected result because this check

if (st > en) {

leads to a zdrop even when zdrop is disabled (zdrop < 0) (not sure if this is by design, but I am planning to file a separate issue). If this check is deleted the generated CIGAR has 4 deletions and 6 matches, as expected.

To run the reproducer copy the code from below into main.cpp and copy ksw2.h, ksw2_extd.c and ksw2_extd2_sse.c to the same folder. Then build with g++ --std=c++17 main.cpp ksw2_extd.c ksw2_extd2_sse.c -o ksw2_repro and execute with ./ksw2_repro

#include <iostream>
#include <vector>

#include "ksw2.h"

void print_cigar(const ksw_extz_t& ez)
{
    if (ez.zdropped == 0) std::cerr << "Not zdropped" << std::endl;
    else std::cout << "ZDROPPED!" << std::endl;

    std::cout << "CIGAR string (operation, length):" << std::endl;
    for(uint32_t cigar_idx = 0; cigar_idx < ez.n_cigar; ++cigar_idx)
    {
        const uint32_t operation = ez.cigar[cigar_idx] & 0xF;
        const uint32_t length    = ez.cigar[cigar_idx] >> 4;
        switch (operation)
        {
            case 0: std::cout << "match:     "; break;
            case 1: std::cout << "insertion: "; break;
            case 2: std::cout << "deletion:  "; break;
            case 3: std::cout << "intron:    "; break;
            default: std::cout << "UNKNOWN OPERATION! ";
        }
        std::cout << length << std::endl;
    }
}

int main()
{
    void* km = nullptr;

    const std::vector<uint8_t> query(6, 0);
    const std::vector<uint8_t> target(10, 0);

    std::cout << "Query length:  " << query.size() << std::endl;
    std::cout << "Target length: " << target.size() << std::endl << std::endl;

    constexpr int8_t m            = 5;
    const std::vector<int8_t> mat = {+2, -4, -4, -4, -1,
                                     -4, +2, -4, -4, -1,
                                     -4, -4, +2, -4, -1,
                                     -4, -4, -4, +2, -1,
                                     -1, -1, -1, -1, -1};

    constexpr int8_t gapo  = 4;
    constexpr int8_t gape  = 2;
    constexpr int8_t gapo2 = 24;
    constexpr int8_t gape2 = 1;

    constexpr int32_t w     = 3;
    constexpr int32_t zdrop = -1;
    constexpr int32_t flag  = 0;

    std::cout << "** Green's formulation (ksw_extd) **" << std::endl;

    ksw_extz_t ez_green;
    ksw_reset_extz(&ez_green);
    ez_green.m_cigar = 2;
    ez_green.cigar = (uint32_t*) malloc(ez_green.m_cigar * sizeof(uint32_t));
    ksw_extd(km,
            query.size(),
            query.data(),
            target.size(),
            target.data(),
            m,
            mat.data(),
            gapo,
            gape,
            gapo2,
            gape2,
            w,
            zdrop,
            flag,
            &ez_green);
    print_cigar(ez_green);

    std::cout << std::endl <<  "** Suzuki's formulation (ksw_extd2_sse) **" << std::endl;
    ksw_extz_t ez_suzuki;
    ksw_reset_extz(&ez_suzuki);
    ez_suzuki.m_cigar = 2;
    ez_suzuki.cigar = (uint32_t*) malloc(ez_suzuki.m_cigar * sizeof(uint32_t));
    ksw_extd2_sse(km,
                  query.size(),
                  query.data(),
                  target.size(),
                  target.data(),
                  m,
                  mat.data(),
                  gapo,
                  gape,
                  gapo2,
                  gape2,
                  w,
                  zdrop,
                  0,
                  flag,
                  &ez_suzuki);
    print_cigar(ez_suzuki);

    free(ez_green.cigar);
    free(ez_suzuki.cigar);

    return 0;
}