Reported CIGAR string is for a suboptimal alignment in some cases.
Opened this issue · 1 comments
macieksk commented
Hi,
The example code here uses an aligner with the following scoring scheme:
Match:2, Mismatch:-1, all_gaps_penalties=1
import ssw_aligner
query='CAGACAATCAGCATGTTTCCGGCAGCGCCGGTAG'
target='TTCCACCATTTGTCCGGACCGGGC'
def get_striped_ed_aligner(query_seq,match_score=2):
return ssw_aligner.StripedSmithWaterman(
query_seq,
gap_open_penalty=1,
gap_extend_penalty=1,
match_score=match_score,
mismatch_score=-1,
suppress_sequences=True,
score_only=False,
score_size=2, ## score is < 255 this should be 0; >255 then 1; 2 don't know
zero_index=True,
mask_length=0, ## Turn off suboptimal
mask_auto=False, ## Turn off suboptimal
)
algn=get_striped_ed_aligner(query,match_score=2)(target)
cigcnt=defaultdict(lambda:0)
def add_tl(d,t,l): d[t]+=l; return
[add_tl(cigcnt,t,l) for l, t in algn._tuples_from_cigar()]
print(cigcnt)
print(algn.__repr__())
This results with CIGAR counts and the alignment stats:
defaultdict(<function <lambda> at 0x7f35c0ec2048>, {'M': 19, 'I': 5, 'D': 1})
{
'optimal_alignment_score': 28,
'suboptimal_alignment_score': 0,
'query_begin': 7,
'query_end': 30,
'target_begin': 1,
'target_end_optimal': 21,
'target_end_suboptimal': -1,
'cigar': '7M1I2M1D5M1I1M3I4M',
'query_sequence': '',
'target_sequence': ''
}
The issue here is that the CIGAR string returned in the alignment above is for a suboptimal alignment of score 26, see the computations below:
#CAGACAA |TCAGCATGTT TCCGG CAGCGCCGG| TAG
# T |TCCACAT TTGTCCGG A CCGG| GC
# 'cigar': '7M 1I2M1D 5M1I1M 3I 4M',
# M=5+2+2+5+1+4=19 Mismatch=2 Match=17 I=5 D=1
# score=2*17-2-5-1=26
print(2*17-2-5-1)
26
The optimal alignment with the reported score has a different CIGAR string, see below.
#CAGACAA |TC AG CATGTT TCCGGCAGCGCCGG| TAG
# T |TC CA CAT TTG TCCGG A CCGG| GC
# 'cigar':'2M1D1M1I3M1I2M1D 5M1I1M3I 4M',
# M=18 Mismatch=0 I=6 D=2
# score=2*18-6-2=28
print(2*18-6-2)
28
Is it possible to fix it such that the reported CIGAR string is always for an optimal alignment?
macieksk commented
Possibly solved in scikit-bio/scikit-bio#1679 ... though seems not quite from tests.