Interleaved paired-end FASTQ file [feature request]

Question

Interleaved paired-end FASTQ file [feature request]

sjackman opened this issue 6 years ago · 9 comments

I prefer using interleaved paired-end FASTQ, for example as output by seqtk mergepe. Does or could Flowcraft support this format?

Answer 1 · 2018-10-01T16:04:14.000Z

Now that I think about it, interleaved FASTQ files would be a good first step toward supporting long reads (Oxford Nanopore and PacBio) and single-end Illumina reads in a single FASTQ file.

Answer 2 · 2018-10-01T16:35:22.000Z

Hello! Support to other types of raw input data, such as long reads, is in our plans to be implemented (https://github.com/assemblerflow/flowcraft/wiki/Roadmap).

We've been discussing how we can implement this with the way components work. Right now we define a component to work with an specific input type, so spades requires as input paired-end reads.

An easy way to add support to multiple input types is to duplicate the components depending on what is their input. For example, for spades we would have a spades_paired_end, spades_single_end, spades_interleaved. We're able to simplify the nomenclature by instead of being defined as "spades_paired_end" it can be "spades_pe", but it still might get confusing.

Answer 3 · 2018-10-01T18:54:14.000Z

As discussed in #62, we would like to support both single and paired end reads in the future. I would prefer a solution that avoids multiplying the components by each specific input type, and just have an automatic way of sorting the type of input at the start. It's quite straightforward to sort between paired/single-end,but not between single end and interleave paired-end. Perhaps we could add a parameter that established the type of the fastq input, for instance

--fastq-type = "paired" # out of "paired", "single", "interleave-paired" etc

And the information on this option to setup the right channels and any modifications that the software may require to process the different kinds of input.

Answer 4 · 2018-10-01T19:41:24.000Z

Would it be reasonable for integrity_coverage to determine whether each file is interleaved, non-interleaved, or single?

Answer 5 · 2018-10-01T19:45:08.000Z

You may want to pass both short read and long read FASTQ files to an assembler, such as Unicycler. I'd suggest…
nextflow run unicycler.nf --fastq-paired='pe.fq.gz' --fastq-unpaired='se.fq.gz' --fastq-long='long.fq.gz'
--fastq could keep its original meaning of non-interleaved paired FASTQ files.

Answer 6 · 2018-10-02T09:16:10.000Z

Hi Shaun
Thanks for the suggestions. In those cases for multiple read input we will probably have to create specific FlowCraft module for those use cases. For the integrity_coverage to work as you suggest the input stream for the module needs to be completely redone. But why your preference to use interleaved files? In our use cases for microbiology we mostly find separated PE files...

Answer 7 · 2018-10-02T17:14:17.000Z

All common sequencing types can be represented using a single FASTQ file: paired-end short-read sequencing, single-end short-read sequencing, long read sequencing. Only one of these sequencing types can be represented using two FASTQ files, namely paired-end short-read sequencing. That's the biggest reason to support interleaved FASTQ files. A single file type makes it easier to make the command line interface of tools consistent.

Paired-end and single-end reads may be interleaved in the same file, requiring only a single FASTQ file. Separating them requires three FASTQ files, complicating the command line interface.

Interleaved FASTQ files can easily be streamed (piped) from one tool to the next.

gunzip -c interleaved.fq.gz | trimadap /dev/stdin | bwa mem -p ref.fa /dev/stdin | samtools view -u -F4 | samtools sort -o aligned.bam

Answer 8 · 2018-10-02T18:33:47.000Z

@sjackman I agree with your assessment but for our use cases, I don’t think I ever used an interleaved fastq file since I started working with microbial ngs data. I used singgle end for about one try with Ion Torrent data. Since then is Illumina PE all the way! If ENA/SRA would suport download as interleaved that would change of course. Nevertheless you make a good point for inclusion, specially if long read sequencing is here to stay ;-).

Answer 9 · 2018-10-02T19:19:38.000Z

If ENA/SRA would suport download as interleaved that would change of course.

sratoolkit does support downloading interleaved FASTQ.

fastq-dump -Z --split-spot SRR7878800 | trimadap /dev/stdin | bwa mem -p ref.fa /dev/stdin | samtools view -u -F4 | samtools sort -o aligned.bam