elsasserlab/minute

bowtie2 2.5.2 crashes when skipping some reads due to read length

cnluzon opened this issue · 3 comments

I think it is some issue in the recent bowtie2 version:

bowtie2 --reorder -p 2 -x ref/ref1 -1 final/demultiplexed/H3K4m3_2i_CTR_rep1_R1.fastq.gz -2 final/demultiplexed/H3K4m3_2i_CTR_rep1_R2.fastq.gz --fast

the input FASTQ files have the same number of reads, so the complain:

Error, fewer reads in file specified with -2 than in file specified with -1
terminate called after throwing an instance of 'int'
Aborted (core dumped)
(ERR): bowtie2-align exited with value 134

Is not true:

$ zcat final/demultiplexed/H3K4m3_2i_CTR_rep1_R1.fastq.gz | grep "^@" | wc
   1339    4017  109860
$ zcat final/demultiplexed/H3K4m3_2i_CTR_rep1_R2.fastq.gz | grep "^@" | wc
   1339    4017  109860

And it does not crash with same files and bowtie2 2.5.1. So I will check it in more detail and maybe open an issue on bowtie2 repository. I am going to pin the version to 2.5.1 but I opened the issue just to keep an eye on it.

I have reported this on bowtie2 GitHub: BenLangmead/bowtie2#451

The issue comes from the previous read trimming step, so basically empty R2 lines on the FASTQ file will make bowtie2 think there is an uneven number of reads in the R1-R2 FASTQ file pair.

Since this is coming from the trimming, maybe you have something to add @marcelm ? I think it is correct to have those R2 lines empty with no sequence, because one still needs to have the read on the R2.fastq so the IDs and numbers of reads match, so it is probably the FASTQ parsing they are doing, since they updated that in 2.5.2: https://github.com/BenLangmead/bowtie2/releases/tag/v2.5.2 (maybe too tolerant with empty lines?).

Since this is coming from the trimming, maybe you have something to add @marcelm ?

You could add -m 1 (or a higher value) to the Cutadapt command line to filter out empty (or very short) reads. By default, this discards the entire pair if one of the reads is too short. (Cutadapt will always ensure read pairs are kept in sync.)

I agree this is probably a regression in Bowtie2: It should not fail, and in particular, it should not fail with this error message.

$ zcat final/demultiplexed/H3K4m3_2i_CTR_rep1_R1.fastq.gz | grep "^@" | wc
   1339    4017  109860

Just a note regarding this: This could give the wrong result because grep "^@" also counts reads where the first quality values is 31 (encoded as @). Just counting the number of lines with wc -l would be more reliable. It doesn’t give you the number of records directly, but that’s ok here since you only want to show that both files have the same number of records.

I am closing this since it was fixed in later versions of bowtie2.