Final characters are sometimes missing in alignment when one sequence is a prefix of another
Closed this issue · 2 comments
I'm not sure if this is expected behavior or not, but when one of the sequences consists of a prefix of another, the final few characters are sometimes left out of the prefix's alignment. It seems to be a variable number of missing characters, depending on the sequence. Here's a small reproducing example (based on your example).
int main() {
// Initialize array
char* inseq[] = {
"ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTGCATGCATGCATGCATGCATGCATGCA",
"ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTGCATGCATGCATGCATGCATGCATGCA",
"ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTTGCATGCATGCATGCATGCATGCATGCA",
"ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
};
int numseq = 4;
int L[] = {76, 76, 76, 48};
char** aln = NULL;
int aln_len = 0;
kalign(inseq, L, numseq, 4 , KALIGN_TYPE_DNA_INTERNAL, -1, -1 , -1, &aln, &aln_len);
printf("Aligned:\n");
for(int i = 0; i < numseq; i++){
printf("%s\n", aln[i]);
}
for (int i = 0; i < numseq; ++i) {
int num_aligned = 0;
for (int j = 0; j < aln_len; ++j) {
if (aln[i][j] != '-' && aln[i][j] != '\0') {
num_aligned++;
}
}
if (num_aligned != L[i]) {
fprintf(stderr, "error: sequence %d is not fully aligned\n", i);
fprintf(stderr, "%s\n", inseq[i]);
fprintf(stderr, "%s\n", aln[i]);
}
}
/* Free alignment */
for(int i = 0; i < numseq;i++){
free(aln[i]);
}
free(aln);
}
Hi Jordan,
Thanks for reporting! This was a bug in the library code. In brief, kalign maintains an array for sequence characters and an array to record where the gaps are to be placed. In the final step, information from these two arrays is combined to create the aligned sequences. The library code ran this step twice...
The problem should be fixed in the latest commit.
Thanks for the quick response!