ArtPoon/gotoh2

Unexpected behaviour around terminal gaps

Closed this issue · 14 comments

I was trying to understand the scoring when a read is a small portion of the reference, and I found some strange behaviour in both Gotoh2 and the Gotoh library we use in MiCall. It looks like Gotoh2 is forcing the first nucleotides of each sequence to align, instead of allowing a large gap at the start. Strangely, it's not symmetrical: Gotoh2 doesn't force the last nucleotides to align.

So I think there's a bug with forcing the first nucleotides to align, but I also have an enhancement request to allow terminal gaps without cost. I would like to be able to use a large reference to score a bunch of small reads, and make the length of the reference not affect the score, only the length of the read and the number of matches, substitutions, and internal gaps. Feel free to pull that into a separate issue, or just decline that part if it's not useful for you.

Here's the script I used:

try:
    from gotoh2.aligner import Aligner
except ImportError:
    Aligner = None

try:
    from gotoh import align_it
except ImportError:
    align_it = None


GAP_OPEN_COST = 10
GAP_EXTEND_COST = 3


def align(seq1, seq2, terminal_cost):
    if Aligner is not None:
        aligner = Aligner()
        aligner.gap_open_penalty = GAP_OPEN_COST
        aligner.gap_extend_penalty = GAP_EXTEND_COST
        return aligner.align(seq1, seq2)
    return align_it(seq1,
                    seq2,
                    GAP_OPEN_COST,
                    GAP_EXTEND_COST,
                    terminal_cost)


def display_alignment(seq1, seq2, terminal_cost):
    aseq1, aseq2, score = align(seq1, seq2, terminal_cost)
    print(score)
    print(aseq1)
    print(aseq2)
    print('')


def run_scenario(terminal_cost, is_reversed=False):
    if Aligner is not None:
        title = 'Gotoh2'
    else:
        title = 'Gotoh'
    ref = 'TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT'
    if is_reversed:
        title += ' reversed'
        ref = ''.join(reversed(ref))
    if terminal_cost == 0:
        title += ' without terminal cost'
    print(title)
    display_alignment(ref, ref, terminal_cost)
    display_alignment(ref, ref[1:-1], terminal_cost)
    display_alignment(ref, ref[2:-1], terminal_cost)
    display_alignment(ref, ref[2:-2], terminal_cost)
    display_alignment(ref, ref[10:-10], terminal_cost)
    display_alignment(ref[1:-1], ref[10:-10], terminal_cost)
    display_alignment(ref[2:-1], ref[10:-10], terminal_cost)
    display_alignment(ref[2:-2], ref[10:-10], terminal_cost)



def main():
    run_scenario(terminal_cost=1)
    run_scenario(terminal_cost=1, is_reversed=True)
    run_scenario(terminal_cost=0)


main()

Here are the results for Gotoh2:

Gotoh2
290
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT

254
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
G-CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

246
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
C--ACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

238
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
C--ACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC--

110
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
C---------C-AACAACAATACAAGAAAAAGTATAAGGATAGGACCA----------

116
GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
CCA---------ACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

119
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
CC-AA-------CAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

122
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC
C-CAA-------CAACAATACAAGAAAAAGTATAAGGATAGGACCA--------

Gotoh2 reversed
290
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT

254
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
A-CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG-

246
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
C--GAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG-

238
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
C--GAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACAC--

110
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
A-C---------CAGGATAGGAATATGAAAAAGAACATAACAACAACC----------

116
ACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG
ACCA---GGA------TAGGAATATGAAAAAGAACATAACAACAACC---------

119
CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG
A---C-----CAGGATAGGAATATGAAAAAGAACATAACAACAACC---------

122
CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACAC
A---C-----CAGGATAGGAATATGAAAAAGAACATAACAACAACC--------

Gotoh2 without terminal cost
290
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT

254
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
G-CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

246
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
C--ACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

238
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
C--ACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC--

110
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
C---------C-AACAACAATACAAGAAAAAGTATAAGGATAGGACCA----------

116
GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
CCA---------ACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

119
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
CC-AA-------CAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

122
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC
C-CAA-------CAACAATACAAGAAAAAGTATAAGGATAGGACCA--------

The original Gotoh has its own quirks. I don't understand why the score goes up from 290 to 293 when the read gets two bases shorter if the terminal cost is zero. I don't understand why shortening the reference sometimes reduces the score when the terminal cost is zero.

Here are the results when I run with the original Gotoh:

Gotoh
290
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT

280
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
-GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

275
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
--CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

270
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
--CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC--

190
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
----------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA----------

190
GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
---------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

190
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
--------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

190
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC
--------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA--------

Gotoh reversed
290
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT

280
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
-ACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG-

275
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
--CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG-

270
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
--CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACAC--

190
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
----------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC----------

190
ACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG
---------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC---------

190
CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG
--------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC---------

190
CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACAC
--------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC--------

Gotoh without terminal cost
290
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT

293
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
-GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

291
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
--CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

286
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
--CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC--

230
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
----------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA----------

227
GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
---------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

224
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
--------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

224
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC
--------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA--------

Thanks for identifying and reporting this issue @donkirkby
I've reproduced the problem on my machine and will start looking into the cause. First step will be to develop a minimal reproducing example.

@donkirkby , your code did not actually toggle a terminal cost (gap penalty) in the gotoh2 code. You need to set the class variable is_global to False in order to turn off the terminal cost.
Here is an updated version of your script:

from gotoh2 import Aligner

GAP_OPEN_COST = 10
GAP_EXTEND_COST = 3


def align(seq1, seq2, terminal_cost):
    aligner = Aligner()
    aligner.is_global = False
    aligner.gap_open_penalty = GAP_OPEN_COST
    aligner.gap_extend_penalty = GAP_EXTEND_COST
    return aligner.align(seq1, seq2)


def display_alignment(seq1, seq2, terminal_cost):
    aseq1, aseq2, score = align(seq1, seq2, terminal_cost)
    print(score)
    print(aseq1)
    print(aseq2)
    print('')


def run_scenario(terminal_cost, is_reversed=False):
    title = 'Gotoh2'
    ref = 'TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT'
    if is_reversed:
        title += ' reversed'
        ref = ''.join(reversed(ref))
    if terminal_cost == 0:
        title += ' without terminal cost'
    print(title)
    display_alignment(ref, ref, terminal_cost)
    display_alignment(ref, ref[1:-1], terminal_cost)
    display_alignment(ref, ref[2:-1], terminal_cost)
    display_alignment(ref, ref[2:-2], terminal_cost)
    display_alignment(ref, ref[10:-10], terminal_cost)
    display_alignment(ref[1:-1], ref[10:-10], terminal_cost)
    display_alignment(ref[2:-1], ref[10:-10], terminal_cost)
    display_alignment(ref[2:-2], ref[10:-10], terminal_cost)


def main():
    run_scenario(terminal_cost=1)
    run_scenario(terminal_cost=1, is_reversed=True)
    run_scenario(terminal_cost=0)


main()

and here is the output:

Gotoh2
290
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT

280
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
-GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

275
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
--CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA-

270
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
--CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC--

190
TGCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCAT
----------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA----------

190
GCACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
---------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

190
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGCA
--------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA---------

190
CACAAGACCCAACAACAATACAAGAAAAAGTATAAGGATAGGACCAGGACAAGC
--------CCAACAACAATACAAGAAAAAGTATAAGGATAGGACCA--------

Gotoh2 reversed
290
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT

280
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
-ACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG-

275
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
--CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG-

270
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
--CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACAC--

190
TACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACGT
----------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC----------

190
ACGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG
---------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC---------

190
CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACACG
--------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC---------

190
CGAACAGGACCAGGATAGGAATATGAAAAAGAACATAACAACAACCCAGAACAC
--------ACCAGGATAGGAATATGAAAAAGAACATAACAACAACC--------

Keeping this issue open because of the following (at version 0.3):

>>> from gotoh2 import Aligner
>>> a = Aligner()
>>> a.is_global
True
>>> a.align('TGCACA', 'CACA')
('TGCACA', 'C--ACA', 8)
>>> a.is_global = False
>>> a.align('CACCA', 'ACCA')
('CACCA', '-ACCA', 20)

Even if we're doing a global alignment (terminal gaps penalized), the C should still be getting lined up with its match in the first sequence.

-- edit --
My gut feeling is that this is expected behaviour of global alignment..

Thanks, @ArtPoon, I didn't know about the is_global attribute. I suggest adding optional parameters to __init__() with documentation of what they do. There's currently some documentation in the align() method, but it's out of date.

@donkirkby done as of commit 85e08c9

Using the following settings:

ref = 'TGCACA'
query = 'CACA'
a.is_global = True
a.gap_open_penalty = 10  # default model is HYPHY_NUC

we get this result:

('TGCACA', 'C--ACA', 8)

If we adjust the gap penalty:

a.gap_open_penalty = 5

then we get:

('TGCACA', '--CACA', 13)

Let's look at the cost matrices. For a penalty of 10, we have:

 0 11 12 13 14 
11  4 15 16 17 
12 15  8 19 20 
13  7 18  3 14 
14 17  2 13 -2 
15  9 13 -3  8 
16 19  4  8 -8 

and for 5 we have:

 0  6  7  8  9 
 6  4 10 11 12 
 7 10  8 14 15 
 8  2  8  3  9 
 9  8 -3  3 -2 
10  4  3 -8 -2 
11 10 -1 -2 -13

Here's a simpler test case:

    def test_issue14(self):
        ref = 'GCA'
        query = 'CA'
        self.g2.is_global = True
        self.g2.gap_open_penalty = 10
        self.g2.gap_extend_penalty = 1
        self.g2.set_model('HYPHY_NUC')
        result = self.g2.align(ref, query)
        expected = ('GCA', '-CA', -1)
        self.assertEqual(expected, result)

with the following result:

======================================================================
FAIL: test_issue14 (test.TestIssues)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/art/git/gotoh2/tests/test.py", line 260, in test_issue14
    self.assertEqual(expected, result)
AssertionError: Tuples differ: ('GCA', '-CA', -1) != ('GCA', 'C-A', -1)

First differing element 1:
'-CA'
'C-A'

- ('GCA', '-CA', -1)
?           -

+ ('GCA', 'C-A', -1)
?          +

The cost matrix (R) is

C A
0 11 12
G 11 4 15
C 12 6 8
A 13 16 1

P matrix:

2147483647 2147483647 2147483647 
11 22 23 
12 15 24 
13 16 19 

Q matrix:

2147483647 11 12 
2147483647 22 15 
2147483647 23 17 
2147483647 24 25 

bits matrix:

0 16 16 
64 84 14 
64 76 20 
64 37 4 

Starting to work this through manually. Looks like the bits matrix is way off...

P, Q, and R matrices all checked out from manual calculation. My bit matrix is:

e,g b,f b,e
a,d,e,g c,e,g d,b
a,d,g c,d,g c,e
a,g a,c,f c

which converts to:

80 24 18
89 84 10
73 76 20
65 37 4

Output of all bit assignments:

e,g b,e,f b,e
a,d,g c,e,g b,c,d
a,d,g c,d,g c,e
a,g a,c,f c
row col to row to col bit comment
0 1 0 0 g
0 1 0 1 b
0 2 0 1 f
0 2 0 2 b
1 0 0 0 e
1 0 1 0 a
1 1 0 1 e Missed this one in manual calculation
1 1 1 0 g
1 1 1 1 c
1 2 0 2 e
1 2 1 1 g
1 2 1 2 b
1 2 1 2 c Missed also, 11 + 4 (mismatch) to 15
2 0 1 0 d
2 0 2 0 a
2 1 1 1 e
2 1 2 0 g
2 1 2 1 c
2 2 1 2 d
2 2 2 1 g
2 2 2 2 c
3 0 2 0 d
3 0 3 0 a
3 1 2 1 d
3 1 3 0 g
3 1 3 1 a
3 1 3 1 c
3 2 2 2 e
3 2 3 1 f
3 2 3 2 c
  • note extra e bit at (1,1) in manual calculation was incorrect, should have been assigned to (0,0)
  • cost assignment checks out. Need to check edge assignments next.

Checking edge assignment. Final bits matrix after edge assignments is:

0 2 34 
1 4 6 
9 21 4 
9 13 4 

which corresponds to:

b b,f
a c b,c
a,d a,c,e c
a,d a,c,d c

The highlighted a is causing the misalignment upon traceback.

Urgh. Didn't read the paper correctly - bit matrices have dimensions increased by one relative to other matrices.

Fixed with commit 6b72df9