lh3/ksw2

Question about extension alignment

mwykes opened this issue · 6 comments

Hi @lh3,

I'd like to use ksw2 in a project and have been running a few initial tests with ksw2-test which have raised some questions.

I'd like to do extension alignment where the ends are free (and had understood from your answer to #5) that I should be able to do this with ksw2.

However, alignments produced by ksw2-test appear to be global alignments which consume all of the query and reference.

./ksw2-test  ATCCCATGCCATGCC ATGCC -t extz
first   second  -14     4       1       1       10D5M
./ksw2-test  ATCCCATGCCATGCC ATGCC -t extz -r
first   second  -14     4       1       1       2M10D3M

The result I'd ideally like would be simply 5M, then truncate the alignment.

Is there a way to achieve this with ksw2? Is there e.g. a way to tell ksw2 that gaps at the ends of sequences should be free?

Thanks,
Mike

rob-p commented

The extension in extension alignment works in one direction. That is, only the tail end is free. Further, only the tail end of either the read or reference is free, not both. It seems like what you're looking for (at least if you dont know your match will always be at the very start or very end) is local alignment. That will return the highest scoring substrings without penalizing non-matching partsbof the strings. Ksw2 doesn't currently have a local alignment API, byt there are a number of other C/C++ libraries that do as its a fairly common alignment type. For example, this is a fairly popular implementation for stripped local alignment.

Hi @rob-p,

Thanks for the reply. The match will always be at the very start and I only need the tail at the end of the reference to be free, so was hoping I could use ksw2 for this. From what you say, it sounds like this should be possible - could you provide some hints as to how to do this? edlib using the option SHW achives this, but I'd like to use an aligner with distinct mismatch, gap and gap extension costs. I also came accross seqan, which has four flags which can be set to determine whether gaps at the four ends should be penalised. Are there similar flags in ksw2 that I should be setting?

lh3 commented

Ksw2 is primarily developed for extension alignment but the command line tool doesn't expose the extension functionality. ksw_extz_t::max_q and max_t give the ending positions.

rob-p commented

Hi @mwykes,

So you may want to check out this discussion between @lh3 and I a couple of years ago. The disconnect I see between what KSW2 does and what you want is that extension alignment dictates that, given that you know the position where the alignment starts one of the following two conditions is true; either:

  • The read/query aligns to its end, but this isn't the end of the corresponding reference sequence provided to the ksw2 call (i.e. you have a "free end" on the reference but align to the end of the query)
  • The read/query aligns to the end of the reference, but this isn't the end of the query itself (i.e. you have a "free end" on the query, but align to the end of the reference)

This is, in fact, what the ksw_extz_t::max_q and max_t values that @lh3 mentions refer to. It seems what you want is that you know the start position, but you have free ends for both the query and reference — sort of an "anchored" local alignment. This is, of course, possible by changing the dynamic program slightly, but I don't believe that ksw2 offers this functionality.

lh3 commented

Also, with extension alignment, both query and target can end early. SHW is not extension alignment and in fact the edlib algorithm can't support real extension alignment. The seqan function you are quoting is not extension alignment, either.

I only need the tail at the end of the reference to be free

This is not extension alignment but ksw2 can mimic this. On return, ksw_extz_t::mqe and mqe_t keep the score and target position if the alignment reaches the end of the query. Note that if the two tails are very different, ksw2 may stop searching before reaching the end of the query. In that case, mqe will be set to KSW_NEG_INF.

The recommended way to use mqe is to set end_bonus when calling ksw_extz2_sse etc. If the max local extension alignment score is higher than max query-end alignment score plus end_bonus, ksw2 reports local extension; otherwise, it reports the alignment that reaches the query end. This helps to reduce reference biases for short reads. It is a minor but important feature. Both bwa-mem and minimap2 have this functionality.

Thanks @lh3!

Setting the KSW_EZ_EXTZ_ONLY flag and a large end_bonus, I was indeed able to achieve what I want, whatever it may be called! ;-)