jeffdaily/parasail

Question about gap open penalty for different vectorizations.

jvkersch opened this issue · 8 comments

The following came up while trying to debug why Parasail would return different results on different machines (more specifically, a fairly recent laptop (i7) and a Travis VM). I narrowed things down to one version of the code dispatching to the AVX2 implementation, and another one to the SSE4 implementation. Surprisingly, while the remainder of the code is exactly the same, the two different vectorizations give different results. There's probably something that I've overlooked, but what could cause this?

The code snippet below reproduces the problem for me (with Parasail master). On my machine, it prints the following:

B                1 AAAAAAAAAA---AAAAAAAAAA      20
                   ||||||||||   ||||||||||
A                1 AAAAAAAAAATTTAAAAAAAAAA      23

Length: 23
Identity:        20/23 (87.0%)
Similarity:      20/23 (87.0%)
Gaps:             3/23 (13.0%)
Score: 97

B                1 AAAAAAAAAA-A-AAAAAAAAA-      20
                   |||||||||| - |||||||||
A                1 AAAAAAAAAATTTAAAAAAAAAA      23

Length: 23
Identity:        19/23 (82.6%)
Similarity:      19/23 (82.6%)
Gaps:             3/23 (13.0%)
Score: 89

Note how the alignments, the score, and the number of matches are different.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "parasail.h"
#include "parasail/matrices/nuc44.h"

int main()
{
    parasail_result_t* result;
    
    const char *s = "AAAAAAAAAATTTAAAAAAAAAA";
    const char *t = "AAAAAAAAAAAAAAAAAAAA";

    /* SSE4 alignment */
    result = parasail_sg_trace_striped_sse41_128_16(
        s, strlen(s), t, strlen(t), 1, 4, &parasail_nuc44);

    parasail_traceback_generic(
        s, strlen(s), t, strlen(t),
        "A", "B", &parasail_nuc44,
        result, '|', '+', '-', 79, 10, 1);

    /* AVX2 alignment */
    result = parasail_sg_trace_striped_avx2_256_16(
        s, strlen(s), t, strlen(t), 1, 4, &parasail_nuc44);

    parasail_traceback_generic(
        s, strlen(s), t, strlen(t),
        "A", "B", &parasail_nuc44,
        result, '|', '+', '-', 79, 10, 1);
    
    return 0;
}

I have confirmed the behavior but haven't looked at it in any detail yet. I don't believe I regularly verified where the gap open penalty was smaller than the gap extend penalty as in your case here. These are the gap {open,extend} penalties I regularly tested:

  • {10,1},
  • {10,2},
  • {14,2},
  • {40,2},

If you're building with the configure script and Makefile, you can run make check or make test to build all the tests. Running ./tests/test_verify -f ../data/test_64.fasta -m nuc44 -o 1 -e 4 will evaluate all vector implementations and print an error whenever there is a mismatch against the non-vectorized routine.

test_64.fasta
> one
AAAAAAAAAATTTAAAAAAAAAA
> two
AAAAAAAAAAAAAAAAAAAA

Is it common for the gap open penalty to be smaller than the gap extend penalty?

One more thing. Can you provide the correct score and traceback for these particular settings using a different tool so that I might verify it against parasail?

@jeffdaily Many thanks for the quick reply! This isn't a serious issue for us, it arose as I was putting together some tests for Cython wrappers that I had written, and the case where the gap open penalty was larger than the gap extend came up as a corner case. Still, it would be good to understand this a little deeper.

I ran the tests that you proposed: make test passes and test_verify gives the following

$ ./tests/test_verify -f ./data/test_64.fasta -m nuc44 -o 1 -e 4
2 choose 2 is 1
checking parasail_nw_sse2 functions
parasail_nw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_nw_scan_sse2_128_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_sse2_128_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_sse2_128_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_sse2_128_8(0,1,1,4,nuc44) wrong score (97!=94)
checking parasail_sg_sse2 functions
parasail_sg_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sg_scan_sse2_128_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_sse2_128_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_sse2_128_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_sse2_128_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_striped_sse2_128_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_sse2_128_8(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_sw_sse2 functions
parasail_sw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sw_scan_sse2_128_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_sse2_128_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_sse2_128_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_sse2_128_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_striped_sse2_128_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_sse2_128_8(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_nw_sse41 functions
parasail_nw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_nw_scan_sse41_128_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_sse41_128_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_sse41_128_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_sse41_128_8(0,1,1,4,nuc44) wrong score (97!=94)
checking parasail_sg_sse41 functions
parasail_sg_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sg_scan_sse41_128_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_sse41_128_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_sse41_128_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_sse41_128_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_striped_sse41_128_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_sse41_128_8(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_sw_sse41 functions
parasail_sw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sw_scan_sse41_128_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_sse41_128_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_sse41_128_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_sse41_128_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_striped_sse41_128_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_sse41_128_8(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_nw_avx2 functions
parasail_nw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_nw_scan_avx2_256_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_avx2_256_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_avx2_256_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_avx2_256_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_striped_avx2_256_16(0,1,1,4,nuc44) wrong score (97!=88)
checking parasail_sg_avx2 functions
parasail_sg_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sg_scan_avx2_256_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_avx2_256_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_avx2_256_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_avx2_256_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_striped_avx2_256_16(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_avx2_256_16(0,1,1,4,nuc44) wrong end_query (22!=21)
parasail_sg_striped_avx2_256_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_avx2_256_8(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_sw_avx2 functions
parasail_sw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sw_scan_avx2_256_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_avx2_256_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_avx2_256_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_avx2_256_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_striped_avx2_256_16(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_avx2_256_16(0,1,1,4,nuc44) wrong end_query (22!=21)
parasail_sw_striped_avx2_256_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_avx2_256_8(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_nw_disp functions
parasail_nw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_nw_scan_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_scan_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_striped_16(0,1,1,4,nuc44) wrong score (97!=88)
parasail_nw_scan_sat(0,1,1,4,nuc44) wrong score (97!=94)
parasail_nw_striped_sat(0,1,1,4,nuc44) wrong score (97!=88)
checking parasail_sg_disp functions
parasail_sg_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sg_scan_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_scan_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_striped_16(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_16(0,1,1,4,nuc44) wrong end_query (22!=21)
parasail_sg_striped_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_8(0,1,1,4,nuc44) wrong end_query (22!=21)
parasail_sg_scan_sat(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sg_striped_sat(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sg_striped_sat(0,1,1,4,nuc44) wrong end_query (22!=21)
checking parasail_sw_disp functions
parasail_sw_scan(0,1,1,4,nuc44) wrong score (97!=91)
parasail_sw_scan_64(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_32(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_16(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_scan_8(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_striped_16(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_16(0,1,1,4,nuc44) wrong end_query (22!=21)
parasail_sw_striped_8(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_8(0,1,1,4,nuc44) wrong end_query (22!=21)
parasail_sw_scan_sat(0,1,1,4,nuc44) wrong score (97!=94)
parasail_sw_striped_sat(0,1,1,4,nuc44) wrong score (97!=89)
parasail_sw_striped_sat(0,1,1,4,nuc44) wrong end_query (22!=21)

I will look for an external tool that can produce the "correct" alignment.

Side question(s): Did you use cython to wrap the entire parasail C API or perhaps just a handful of the ones you need? When I started the https://github.com/jeffdaily/parasail-python repo, I initially used cython. But I had little experience with cython on windows, so I switched to ctypes. Does parasail-python not meet your needs such that you needed your own cython bindings?

To answer your side question: we already had some Cython code (independent of Parasail) so it was easier to continue along that road than to add a new dependency. Starting from scratch I'd probably use parasail-python. In fact, I started making a Conda package for parasail-python, but I got bogged down with build issues (with the Conda build system, not with parasail-python).

A gap open penalty that is smaller than the gap extend penalty is not supported. I will note this in the readme. I appreciate the question/discussion! Thanks for being a user.

Thanks @jeffdaily ! Congrats on the new release.