algbio/themisto

Incorrect number of read alignments returned

harry-thorpe opened this issue · 12 comments

I have just updated to v3.2.1 and am experiencing a strange error, where the forward reads return the correct number of alignments, but the reverse reads don't. The problem only occurs when gzipping and sorting the output file, and the number of alignments returned for the reverse reads varies each time I run the themisto command.
I'm happy to provide input files so you can reproduce the problem, but these are close to 100mb so not sure if GitHub is the best place to do that.
I have illustrated the problem below:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_1.fastq.gz --outfile ali_1.aln --rc --temp-dir tmp --n-threads 4 --sort-output --gzip-output
zcat ali_1.aln.gz | wc -l
253658 -- correct number of output alignments

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --sort-output --gzip-output
zcat ali_2.aln.gz | wc -l
223647 -- incorrect number of output alignments. The combination of gzipped and sorted output produces the error with these input reads

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_1.fastq.gz --outfile ali_1.aln --rc --temp-dir tmp --n-threads 4 --gzip-output
zcat ali_1.aln.gz | wc -l
253658

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --gzip-output
zcat ali_2.aln.gz | wc -l
253658

It also works with sorted output, but no gzipping:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --sort-output
cat ali_2.aln | wc -l
253658

Thanks for reporting this. I will investigate with high priority.

Thanks, I just updated my original comment - the problem appears to be the combination of sorting and gzipping - it works fine with either sorting or gzipping, but not both. And only for one input read file, which happens to be the reverse (_2) file. Perhaps the typically worse quality reverse reads are more likely to induce an edge case (more variable characters on the read quality lines)?

Meanwhile, could you check on your data if this error occurs if you limit the number of threads to 1?

Yep, still occurs:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 1 --sort-output-lines --gzip-output

zcat ali_2.aln.gz | wc -l
253645

I checked with 1 thread for gzipping and sorting separately too - no error

I could not find a reason for why this might happen. Sound like it might be memory corruption since the answer varies between runs. Could you send the input files somehow? For example, Google drive, Dropbox, or similar?

Thank you for the files! I was able to reproduce this and found the bug. It turns out that the output flush function in the zstr library is a lie, and it doesn't actually flush anything. This resulted in the output sometimes not being fully written to disk before sorting, which meant that some of the last lines of the output sometimes went missing. The number of missing lines was nondeterministic probably because with parallelism, the buffer flushing can happen at unpredictable times, leading a different number of missing lines each time.

It's now fixed in this commit: a0e0698

This fix will be included in the binaries for Themisto v3.2.2, which I hope to release today, or tomorrow the latest.

Version 3.2.2 is out, fixing this bug: https://github.com/algbio/themisto/releases/tag/v3.2.2.

Amazing, thanks! Will try it today :)

Thanks, works fine after the update!

I just noticed I had accidentally compiled the v3.2.2 release binary with max k = 255. It still works, but on the most common use case of k = 31, during index construction it uses up to 8 times more disk and is up to 8 times slower than with the usual max k = 31 setting. I recompiled with k = 31 and updated the binary. If you're running construction on that binary, and are finding it slow, I recommend using the updated binary attached to the release.