Fastq preprocessing causing hisat2 error

Question

Fastq preprocessing causing hisat2 error

DAWells opened this issue 8 months ago · 5 comments

Thanks for making RNAflow

I'm getting an error about incorrect fastq file format from hisat2. Looking at the fastq file, the error is a read that has two lines of quality information, i.e. a single read with 5 lines. This seems to be caused by sortmerna as the raw fastq file is correctly formatted and is still correct after fastp trimming.

I'm working with paired SRA data, this issue is happening with the R2 reads from SRR4289730.

Is there a fix other than the sortmeRNA step or manually correcting the offending read?

Here's the full error message:

Error executing process > 'preprocess_illumina:hisat2 (Pt31)'

Caused by:                                                                                                 Missing output file(s) `Pt31_summary.log` expected by process `preprocess_illumina:hisat2 (Pt31)`

Command executed:

  hisat2 -x reference -1 Pt31.R1.other.fastq.gz -2 Pt31.R2.other.fastq.gz -p 20 --new-summary --summary-file Pt31_summary.log  | samtools view -bS | samtools sort -o Pt31.sorted.bam -T tmp --threads 20

Command exit status:
  0
                                                                                                         Command output:
  (empty)

Command error:
  Error: Read BBFFFFFFFFFFIIIIIIIIIIIIIFIIIIIIIIIIFFFBBFFFFFBF7BBFFBBBFBBBFFFFBBFFFBFBBB<BBFFBBFBFFFFFFB7BBFFFFF has more read characters than quality values.
  terminate called after throwing an instance of 'int'
  Aborted (core dumped)
  (ERR): hisat2-align exited with value 134
  [bam_sort_core] merging from 60 files and 20 in-memory blocks...                                        
Work dir:
  /data/vaccitech/hugo_rnaseq/work/e2/2cd95117732737dde868883d150ab8

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

This is the offending section of the fastq file

@SRR4289730.80811216_2
CTCATGCAGTTTACATTCATTTCTTCCACAGAGAAACCACGGGAAGCTTGTTTTGACCCAGGAAATATAATGAATGGGACAAGAGTTGGAACAGACTTC
+SRR4289730.80811216 80811216 length=99
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFFFFFF<BFFFFFFFFFFFF
BBBFFFFFFFFFFIIIIIIIIIIIIIFIIIIIIIIIIFFFBBFFFFFBF7BBFFBBBFBBBFFFFBBFFFBFBBB<BBFFBBFBFFFFFFB7BBFFFFF
@SRR4289730.80811217_2
GGGTTCATGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGCGCCCGCCACCACGCCCAGCTAATTTCTTTTTGTATTTTTTTTAGTAA
+SRR4289730.80811217 80811217 length=99
BBBFFFFFFFFFFIIIIIFFFFBFFBFFF07BFFFIIIIIIBFIFIIFFIFIIIIFFFFFIFFBBFFFFIF<FFFBFFFIIIFIFFFFFFFFBFFFB<B

Thanks!

Answer 1 · 2024-05-08T15:30:46.000Z

Hey, thanks for reporting!

Ok, that's weird but thanks for already digging for the cause of the error. When you run the pipeline w/o the SortMeRNA step, does it work? (--skip_sortmerna, https://github.com/hoelzer-lab/rnaflow/blob/master/main.nf#L945)

Just to be sure, are you using the latest -r 1.4.7 of the pipeline?

It looks like SortMeRNA removes a read but "forgets" to remove the quality line. I fear that is something we can not easily fix in RNAflow directly.

Answer 2 · 2024-05-17T10:53:56.000Z

Yes it looks like it is SortMeRNA as rnaflow runs fine with --skip_sortmerna but gives the above error without it. Both we run using -r 1.4.7.

Thanks!

Answer 3 · 2024-05-21T17:25:53.000Z

Thanks! I fear we don't have capacity to fix this soon. Please skip sortmerna then and clean the reads manually. You can also try

https://github.com/rki-mf1/clean

with the remove rRNA parameter. In its core, this will use the same sortmerna database but a faster mapping-based approach. After that, you can use the cleaned reads in rnaflow and skip sortmerna.

Sorry for the hussle

Answer 4 · 2024-05-22T09:43:34.000Z

Thanks for the recommendation and the whole workflow!

Answer 5 · 2024-05-22T13:34:32.000Z

Closing this bc we can't probably fix that on the side of RNAflow