Incorrect number of read alignments returned

Question

Incorrect number of read alignments returned

harry-thorpe opened this issue 6 months ago · 12 comments

I have just updated to v3.2.1 and am experiencing a strange error, where the forward reads return the correct number of alignments, but the reverse reads don't. The problem only occurs when gzipping and sorting the output file, and the number of alignments returned for the reverse reads varies each time I run the themisto command.
I'm happy to provide input files so you can reproduce the problem, but these are close to 100mb so not sure if GitHub is the best place to do that.
I have illustrated the problem below:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_1.fastq.gz --outfile ali_1.aln --rc --temp-dir tmp --n-threads 4 --sort-output --gzip-output
zcat ali_1.aln.gz | wc -l
253658 -- correct number of output alignments

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --sort-output --gzip-output
zcat ali_2.aln.gz | wc -l
223647 -- incorrect number of output alignments. The combination of gzipped and sorted output produces the error with these input reads

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_1.fastq.gz --outfile ali_1.aln --rc --temp-dir tmp --n-threads 4 --gzip-output
zcat ali_1.aln.gz | wc -l
253658

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --gzip-output
zcat ali_2.aln.gz | wc -l
253658

Answer 1 · 2024-04-29T14:12:52.000Z

It also works with sorted output, but no gzipping:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 4 --sort-output
cat ali_2.aln | wc -l
253658

Answer 2 · 2024-04-29T14:21:08.000Z

Thanks for reporting this. I will investigate with high priority.

Answer 3 · 2024-04-29T14:24:37.000Z

Thanks, I just updated my original comment - the problem appears to be the combination of sorting and gzipping - it works fine with either sorting or gzipping, but not both. And only for one input read file, which happens to be the reverse (_2) file. Perhaps the typically worse quality reverse reads are more likely to induce an edge case (more variable characters on the read quality lines)?

Answer 4 · 2024-04-29T14:26:38.000Z

Meanwhile, could you check on your data if this error occurs if you limit the number of threads to 1?

Answer 5 · 2024-04-29T14:33:51.000Z

Yep, still occurs:

themisto pseudoalign --index-prefix ref/K_gri/ref_idx/ref_idx --query-file K_gri_2.fastq.gz --outfile ali_2.aln --rc --temp-dir tmp --n-threads 1 --sort-output-lines --gzip-output

zcat ali_2.aln.gz | wc -l
253645

I checked with 1 thread for gzipping and sorting separately too - no error

Answer 6 · 2024-04-30T11:25:15.000Z

I could not find a reason for why this might happen. Sound like it might be memory corruption since the answer varies between runs. Could you send the input files somehow? For example, Google drive, Dropbox, or similar?

Answer 7 · 2024-04-30T11:38:48.000Z

Yep - here:
https://drive.google.com/drive/u/5/folders/1X7K7M6K8qN4_o6ntGvSrtlBNuK01VYAp

Answer 8 · 2024-04-30T19:43:11.000Z

Thank you for the files! I was able to reproduce this and found the bug. It turns out that the output flush function in the zstr library is a lie, and it doesn't actually flush anything. This resulted in the output sometimes not being fully written to disk before sorting, which meant that some of the last lines of the output sometimes went missing. The number of missing lines was nondeterministic probably because with parallelism, the buffer flushing can happen at unpredictable times, leading a different number of missing lines each time.

It's now fixed in this commit: a0e0698

This fix will be included in the binaries for Themisto v3.2.2, which I hope to release today, or tomorrow the latest.

Answer 9 · 2024-05-01T14:39:57.000Z

Version 3.2.2 is out, fixing this bug: https://github.com/algbio/themisto/releases/tag/v3.2.2.

Answer 10 · 2024-05-02T06:48:24.000Z

Amazing, thanks! Will try it today :)

Answer 11 · 2024-05-07T09:07:19.000Z

Thanks, works fine after the update!

Answer 12 · 2024-06-10T15:51:02.000Z

I just noticed I had accidentally compiled the v3.2.2 release binary with max k = 255. It still works, but on the most common use case of k = 31, during index construction it uses up to 8 times more disk and is up to 8 times slower than with the usual max k = 31 setting. I recompiled with k = 31 and updated the binary. If you're running construction on that binary, and are finding it slow, I recommend using the updated binary attached to the release.