Reported CIGAR string is for a suboptimal alignment in some cases.

Question

Reported CIGAR string is for a suboptimal alignment in some cases.

Opened this issue 5 years ago · 1 comments

Hi,

The example code here uses an aligner with the following scoring scheme:
Match:2, Mismatch:-1, all_gaps_penalties=1

import ssw_aligner

query='CAGACAATCAGCATGTTTCCGGCAGCGCCGGTAG'
target='TTCCACCATTTGTCCGGACCGGGC'

def get_striped_ed_aligner(query_seq,match_score=2):
    return ssw_aligner.StripedSmithWaterman(
      query_seq,
      gap_open_penalty=1,
      gap_extend_penalty=1,
      match_score=match_score,
      mismatch_score=-1,
      suppress_sequences=True,
      score_only=False,
      score_size=2, ## score is < 255 this should be 0; >255 then 1; 2 don't know
      zero_index=True,
      mask_length=0,   ## Turn off suboptimal
      mask_auto=False, ## Turn off suboptimal
    )
algn=get_striped_ed_aligner(query,match_score=2)(target)

cigcnt=defaultdict(lambda:0)
def add_tl(d,t,l): d[t]+=l; return
[add_tl(cigcnt,t,l) for l, t in algn._tuples_from_cigar()]
print(cigcnt)

print(algn.__repr__())

This results with CIGAR counts and the alignment stats:

defaultdict(<function <lambda> at 0x7f35c0ec2048>, {'M': 19, 'I': 5, 'D': 1})
{
    'optimal_alignment_score': 28,
    'suboptimal_alignment_score': 0,
    'query_begin': 7,
    'query_end': 30,
    'target_begin': 1,
    'target_end_optimal': 21,
    'target_end_suboptimal': -1,
    'cigar': '7M1I2M1D5M1I1M3I4M',
    'query_sequence': '',
    'target_sequence': ''
}

The issue here is that the CIGAR string returned in the alignment above is for a suboptimal alignment of score 26, see the computations below:

#CAGACAA |TCAGCATGTT TCCGG CAGCGCCGG| TAG
#      T |TCCACAT TTGTCCGG  A   CCGG| GC
# 'cigar':  '7M    1I2M1D 5M1I1M 3I  4M',
#                           M=5+2+2+5+1+4=19  Mismatch=2  Match=17 I=5 D=1
#                           score=2*17-2-5-1=26
print(2*17-2-5-1)
26

The optimal alignment with the reported score has a different CIGAR string, see below.

#CAGACAA |TC  AG  CATGTT  TCCGGCAGCGCCGG| TAG
#      T |TC CA   CAT TTG TCCGG A   CCGG| GC
# 'cigar':'2M1D1M1I3M1I2M1D  5M1I1M3I  4M',
#                           M=18 Mismatch=0 I=6 D=2  
#                           score=2*18-6-2=28
print(2*18-6-2)
28

Is it possible to fix it such that the reported CIGAR string is always for an optimal alignment?

Answer 1 · 2019-12-04T17:23:24.000Z

Possibly solved in scikit-bio/scikit-bio#1679 ... though seems not quite from tests.