ArtPoon/gotoh2

Failing simple alignment with gop=10 (default) but not with gop=8

Opened this issue · 8 comments

from gotoh2 import Aligner, map_coordinates

g1 = Aligner(gop=10)  # default gop=10
g2 = Aligner(gop=8)   

ref   = 'TACGTA'
query = 'TACTA'  # G removed
a1 = g1.align(ref, query)
a2 = g2.align(ref, query)
print(a1)
print(a2)

('TACGTA', 'TACT-A', 14)
('TACGTA', 'TAC-TA', 16)

Thanks for reporting this @davidchampredon
I've reproduced the issue.

Score is consistent with the correct alignment

 T  A  C   G  T  A
 T  A  C   -  T  A 
+5 +5 +5 -11 +5 +5 = 14

It's issue #16 - I knew I'd pay for that cheap fix some day..
Commenting out these lines:

gotoh2/src/_gotoh2.c

Lines 121 to 126 in 348f7f7

if (!set.is_global && i == mx->nrows) {
mx->bits[(i * (mx->ncols+1)) + j] = 4;
}
if (!set.is_global && j == mx->ncols) {
mx->bits[(i * (mx->ncols+1)) + j] = 4;
}

yields the correct alignment, but breaks two other unit tests.

It wasn't those lines - the problem was that local alignment requires one to zero out positive costs in the R matrix.

Crap. Traceback failed while I was validating these changes on SARS-COV-2 sequences:

Process 0 of 1 running hCoV19/pangolin/Guangxi/P4L/2017|EPI_ISL_410538|2017
traceback failed: i=1 j=1 bit=0
Traceback (most recent call last):
  File "scripts/first-align.py", line 135, in <module>
    main()
  File "scripts/first-align.py", line 119, in main
    trim_seq, inserts = pair_align(refseq, s, threshold=args.threshold)
  File "scripts/first-align.py", line 66, in pair_align
    aref, aquery, ascore = aligner.align(ref, query)
  File "/usr/local/lib/python3.6/dist-packages/gotoh2-0.1-py3.6-linux-x86_64.egg/gotoh2.py", line 99, in align
    self.matrix
RuntimeError: Traceback failed, try local alignment

Temporary patch:

  • revert these lines of code:

    gotoh2/src/_gotoh2.c

    Lines 190 to 192 in 65a29a5

    if (!set.is_global && mx->R[here] > 0) {
    mx->R[here] = 0;
    }

  • SARS-COV-2 alignments looked fine but let's switch to another alignment program and write a wrapper for gotoh2 to use that program by default (C program optional)

  • time permitting, diagnose traceback issue...

Cost matrix:

   *  A   C   T   A
*  0  0   0   0   0 
A  0 -5   0   0  -5 
C  0  0 -10   0   0 
G  0  0   0  -6   0 
T  0  0   0  -4  -2 
A  0 -5   0   0  -9 

Bit matrix before edge assignment:

  *  A   C  T  A
* 80 80  80 80 16  4 
A 80 84  44 44 20  4 
C 80 44  84 50 10  4 
G 80 44  73 84 14  4 
T 80 44 105 76 28  4 
A 64 68  33 37  4  4 
   4  4   4  4  4  4

Bit matrix after edge assignment:

   *  A  C  T  A
*  0 80 80  0  0  4  
A 80  4  8 32  4  4 
C 80 32  4  2 34  4 
G 80 40  1  4  6  4 
T  0  8 41 21  4  4 
A  0  4  9 13  4  4 
   4  4  4  4  4  4 

I think that 21 in the last matrix is problematic - encoding 16, 4 and 1 (e, c, a), where a indicates a vertical edge is on the optimal path.