yangao07/abPOA

progressive mode does not work with seeding

Closed this issue · 12 comments

There seems to be a conflict between -p and -S, where abpoa exits 1 and prints no output if the two options are used together. Could you please clarify if this is a bug or intended behaviour? If it's the latter, I think it would be good to add an informative error message.

Example. t.fa

1>
AGGTTCTATGGCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCAGTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACATCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGTGGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
>2
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGATGCTATGCTACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG

works fine with all combinations of -p or -S by themselves:

abpoa t.fa -m 0 -r 1
[main] CMD:  abpoa -m 0 -r 1 t.fa
>Multiple_sequence_alignment
AGGTTCTAT----G----GCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCA------GTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACA------------------------------------------------TCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGT------------GGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGA--TGCTATGC-TACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG
[abpoa_main] Real time: 0.001 sec; CPU: 0.001 sec; Peak RSS: 0.003 GB.
abpoa t.fa -m 0 -r 1 -S
[main] CMD:  abpoa -m 0 -r 1 -S t.fa
>Multiple_sequence_alignment
AGGTTCTAT----G----GCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCA------GTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACA------------------------------------------------TCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGT------------GGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGA--TGCTATGC-TACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG
[abpoa_main] Real time: 0.002 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.
abpoa t.fa -m 0 -r 1 -p
[main] CMD:  abpoa -m 0 -r 1 -p t.fa
>Multiple_sequence_alignment
AGGTTCTAT----G----GCTATACAAATTCAGCCGCCCATGTATGAAGAGAGTTTCTTAGTTATGCA------GTAGATAGCTCAATCATCAGCAGTTTTATTGCAGCACACAACA------------------------------------------------TCTATCAAAGGGTATCCTCTTCTTGCCTAGAGATTAGCACTAGTGGCATGATTATGACTTGCTATGCATACTTTCGT------------GGGTAATCATTTTTAAATTCCTGATTTCCTTTACATTCAG
AGGTTCTATATTAGTCATGTTATACAAAATCATTTGCCCATGTATTATCAGAATCTCTTAGTTATGCAGTAGATGTAGATAGCTTGATCATTGGCAGTTTTGTTGCAGTATACAACAGATGATTACATCATAAATGGTCTCTGAGGTAAGTGTATCTGGTAATCTTGTATCAAAGGGTATCCTCTTTTTGCCTAGACATTAGCGCCAGTGGCATCATCATGA--TGCTATGC-TACTTTCATTGCAGAGAGCCGGGGTAATCGTTTCTGAATTCCTGAGTTCCTTTGCATTCAG
[abpoa_main] Real time: 0.002 sec; CPU: 0.006 sec; Peak RSS: 0.003 GB.

but not with -p and -S together:

abpoa t.fa -m 0 -r 1 -p -S
[main] CMD:  abpoa -m 0 -r 1 -p -S t.fa
[abpoa_add_graph_sequence] seq_l: 0	start: 0	end: 0.
echo $?
1

Hi Glenn,

Just find out that this is a bug in the progressive mode, fixed now.

Yan

Thanks for your fast response! Here's a small example that still fails after your fix, though:

u.fa

>0
CTGGGCTAGT
>1
CTCAGCAAGT
>2
CTTGGATTTC

A segfault this time but once again, it only fails with both -p -S

abpoa -r 1 -m 0 u.fa -p -S
[main] CMD:  abpoa -r 1 -m 0 -p -S u.fa
Segmentation fault (core dumped)

abpoa -r 1 -m 0 u.fa -p
[main] CMD:  abpoa -r 1 -m 0 -p u.fa
>Multiple_sequence_alignment
CTGGGCTAGT
CTCAGCAAGT
CTTGGATTTC
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

abpoa -r 1 -m 0 u.fa -S
[main] CMD:  abpoa -r 1 -m 0 -S u.fa
>Multiple_sequence_alignment
CTGGGCTAGT
CTCAGCAAGT
CTTGGATTTC
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

abpoa -r 1 -m 0 u.fa
[main] CMD:  abpoa -r 1 -m 0 u.fa
>Multiple_sequence_alignment
CTGGGCTAGT
CTCAGCAAGT
CTTGGATTTC
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

This is another bug.
These two bugs are because of no minimizer or no minimizer hit existing.

Thanks for catching them.

Yan

Thanks again! Here's another one:

v.fa

>0
TAGTAGTCCA
>1
TAGGACGATTTAGGAATCTA
>2
CAGGAATCTT

Works without -s -P

abpoa v.fa -m 0 -r 1
[main] CMD:  abpoa -m 0 -r 1 v.fa
>Multiple_sequence_alignment
TAG-------TAG---TCCA
TAGGACGATTTAGGAATCTA
----------CAGGAATCTT
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

But last row disappears with -S -p

abpoa v.fa -m 0 -r 1 -S -p
[main] CMD:  abpoa -m 0 -r 1 -S -p v.fa
>Multiple_sequence_alignment
TAG-------TAG---TCCA
TAGGACGATTTAGGAATCTA
--------------------
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

Another silly bug, fixed now. Thanks, Glenn!

Also one more thing, the progressive mode will work only if the seeding mode is enabled because they are using the same set of minimizers.
Otherwise, abPOA skips the progressive tree building if the seeding mode is disabled.

Right now I think separating them makes more sense, working on that now.

thanks for the clarification. here's another case

w.fa

>0
AGGA
>1
AGAA
>2
AGGA
abpoa w.fa -m 0 -r 1 -S 
[main] CMD:  abpoa -m 0 -r 1 -S w.fa
>Multiple_sequence_alignment
AGGA
AGAA
AGGA
[abpoa_main] Real time: 0.001 sec; CPU: 0.005 sec; Peak RSS: 0.003 GB.
abpoa w.fa -m 0 -r 1 -S -p
[main] CMD:  abpoa -m 0 -r 1 -S -p w.fa
>Multiple_sequence_alignment
AGGA
----
----
[abpoa_main] Real time: 0.001 sec; CPU: 0.004 sec; Peak RSS: 0.003 GB.

Kind of left this bug there in the last commit by mistake.
I think we are good this time. :)

Yan

Thanks, this version gets through cactus's integration tests except for one edge case -- it doesn't work always work on 1-sequence cases. Easy enough to work around, but perhaps you still want to fix?

w.fa

>0
AAAAACGTACCACGATACACAGATATAGAACCCCCCCCCCCCCCCCCTAGAGAG
abpoa -m 0 -r 1 w.fa
[main] CMD:  abpoa -m 0 -r 1 w.fa
>Multiple_sequence_alignment
AAAAACGTACCACGATACACAGATATAGAACCCCCCCCCCCCCCCCCTAGAGAG
[abpoa_main] Real time: 0.000 sec; CPU: 0.004 sec; Peak RSS: 0.004 GB.
abpoa -m 0 -r 1 w.fa -p
[main] CMD:  abpoa -m 0 -r 1 -p w.fa
[abpoa_add_graph_sequence] seq_l: 0	start: 0	end: 0.

Right, just fixed it.
Rare case, but should better be fixed.

It is still sometimes removing sequences in the output...

wget http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_feb24.fa

abpoa abpoa_fail_feb24.fa -m 0 -r 1 | tail -1 | sed -e 's/-//g'
[main] CMD:  abpoa -m 0 -r 1 abpoa_fail_feb24.fa
[abpoa_main] Real time: 0.040 sec; CPU: 0.044 sec; Peak RSS: 0.025 GB.
GTTTGGAGACAGGTAAAATCAGGGTTAGAATAGGGTAGTGTTAGGGTAGTGTGTGGGTGTGGGTGTGTGGGTGTGGTGTGTGGGTGTGGTGTGTGTGGGTGTGGTGTGTGGGTGGGGGGGGGGGGGGGGGGGGGGGGGGGTGTGGGGTGTGGGTGTGGGTGTGGGTGTGGGTGTGGTGTGGGTGTGGTGTGTGGGTGTGG

# last sequence disappears with -p
abpoa abpoa_fail_feb24.fa -m 0 -r 1 -p | tail -1 | sed -e 's/-//g'
[main] CMD:  abpoa -m 0 -r 1 -p abpoa_fail_feb24.fa
[abpoa_main] Real time: 0.017 sec; CPU: 0.021 sec; Peak RSS: 0.027 GB.

This is actually a fatal bug that may result in wrong progressive tree orders.
I have to admit that I haven't tested the progressive mode on enough datasets.
Thanks so much, Glenn.

Thanks! At least the bugs were very easy to find -- better it fail completely than seem to work but do the wrong clustering. Anyway, the current version seems to be working, but I will still try it on a much bigger dataset than I have so far...