algbio/themisto

Odd behavior if input multifasta to `build` contains empty sequences

tmaklin opened this issue · 1 comments

Affects at least Themisto v2.1.0.

If the input multifasta file to themisto build contains a sequence with no nucleotides and the built index is used in the pseudoalign command, then the pseudoalignment seems to skip all sequences that come after the empty sequence in the fasta file and reports only matches in the ones that came before it. This seems a bit weird to me :)

I think the intuitive behavior in this case would be to either warn about the empty sequences during index building and prune them from the index, or exit the build process with a helpful error instructing the user to fix the issue (which would be my preferred option).

Reproducable example

Input data

  • example.fasta
>false_match
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>empty_seq
>true_match
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  • example.fastq
>read
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Commands ran

themisto build -k 31 -i example.fasta -o index --temp-dir tmp -m 2000 -t 4
themisto pseudoalign -q example.fastq -o example.aln -i index --temp-dir tmp -t 4

Output

  • What is returned:
0 
  • "Expected" output:
0 2 

Since commit aeeecef, Themisto now quits with an error message if empty sequences are found in the input files. See #29.