Fastq preprocessing causing hisat2 error
DAWells opened this issue · 5 comments
Thanks for making RNAflow
I'm getting an error about incorrect fastq file format from hisat2. Looking at the fastq file, the error is a read that has two lines of quality information, i.e. a single read with 5 lines. This seems to be caused by sortmerna as the raw fastq file is correctly formatted and is still correct after fastp trimming.
I'm working with paired SRA data, this issue is happening with the R2 reads from SRR4289730.
Is there a fix other than the sortmeRNA step or manually correcting the offending read?
Here's the full error message:
Error executing process > 'preprocess_illumina:hisat2 (Pt31)'
Caused by: Missing output file(s) `Pt31_summary.log` expected by process `preprocess_illumina:hisat2 (Pt31)`
Command executed:
hisat2 -x reference -1 Pt31.R1.other.fastq.gz -2 Pt31.R2.other.fastq.gz -p 20 --new-summary --summary-file Pt31_summary.log | samtools view -bS | samtools sort -o Pt31.sorted.bam -T tmp --threads 20
Command exit status:
0
Command output:
(empty)
Command error:
Error: Read BBFFFFFFFFFFIIIIIIIIIIIIIFIIIIIIIIIIFFFBBFFFFFBF7BBFFBBBFBBBFFFFBBFFFBFBBB<BBFFBBFBFFFFFFB7BBFFFFF has more read characters than quality values.
terminate called after throwing an instance of 'int'
Aborted (core dumped)
(ERR): hisat2-align exited with value 134
[bam_sort_core] merging from 60 files and 20 in-memory blocks...
Work dir:
/data/vaccitech/hugo_rnaseq/work/e2/2cd95117732737dde868883d150ab8
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
This is the offending section of the fastq file
@SRR4289730.80811216_2
CTCATGCAGTTTACATTCATTTCTTCCACAGAGAAACCACGGGAAGCTTGTTTTGACCCAGGAAATATAATGAATGGGACAAGAGTTGGAACAGACTTC
+SRR4289730.80811216 80811216 length=99
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFF<BFFFFFFFFFFFF
BBBFFFFFFFFFFIIIIIIIIIIIIIFIIIIIIIIIIFFFBBFFFFFBF7BBFFBBBFBBBFFFFBBFFFBFBBB<BBFFBBFBFFFFFFB7BBFFFFF
@SRR4289730.80811217_2
GGGTTCATGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGCGCCCGCCACCACGCCCAGCTAATTTCTTTTTGTATTTTTTTTAGTAA
+SRR4289730.80811217 80811217 length=99
BBBFFFFFFFFFFIIIIIFFFFBFFBFFF07BFFFIIIIIIBFIFIIFFIFIIIIFFFFFIFFBBFFFFIF<FFFBFFFIIIFIFFFFFFFFBFFFB<B
Thanks!
Hey, thanks for reporting!
Ok, that's weird but thanks for already digging for the cause of the error. When you run the pipeline w/o the SortMeRNA
step, does it work? (--skip_sortmerna
, https://github.com/hoelzer-lab/rnaflow/blob/master/main.nf#L945)
Just to be sure, are you using the latest -r 1.4.7
of the pipeline?
It looks like SortMeRNA
removes a read but "forgets" to remove the quality line. I fear that is something we can not easily fix in RNAflow directly.
Yes it looks like it is SortMeRNA
as rnaflow runs fine with --skip_sortmerna
but gives the above error without it. Both we run using -r 1.4.7
.
Thanks!
Thanks! I fear we don't have capacity to fix this soon. Please skip sortmerna then and clean the reads manually. You can also try
https://github.com/rki-mf1/clean
with the remove rRNA parameter. In its core, this will use the same sortmerna database but a faster mapping-based approach. After that, you can use the cleaned reads in rnaflow and skip sortmerna.
Sorry for the hussle
Thanks for the recommendation and the whole workflow!
Closing this bc we can't probably fix that on the side of RNAflow