Odd behavior if input multifasta to `build` contains empty sequences
tmaklin opened this issue · 1 comments
tmaklin commented
Affects at least Themisto v2.1.0.
If the input multifasta file to themisto build
contains a sequence with no nucleotides and the built index is used in the pseudoalign
command, then the pseudoalignment seems to skip all sequences that come after the empty sequence in the fasta file and reports only matches in the ones that came before it. This seems a bit weird to me :)
I think the intuitive behavior in this case would be to either warn about the empty sequences during index building and prune them from the index, or exit the build process with a helpful error instructing the user to fix the issue (which would be my preferred option).
Reproducable example
Input data
- example.fasta
>false_match
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>empty_seq
>true_match
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
- example.fastq
>read
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
Commands ran
themisto build -k 31 -i example.fasta -o index --temp-dir tmp -m 2000 -t 4
themisto pseudoalign -q example.fastq -o example.aln -i index --temp-dir tmp -t 4
Output
- What is returned:
0
- "Expected" output:
0 2