Support for paired-end reads?

Question

Support for paired-end reads?

tseemann opened this issue 5 years ago · 5 comments

Are there plans to support Illumina paired reads?

Answer 1 · 2019-11-25T13:53:05.000Z

HA, I had an internal bet with myself about whether this would be the first issue raised.

I am happy to support any feature that users are likely to want/need and that stays within the scope of the project. This definitely seems to tick both of those boxes.

I haven't actually worked much with Illumina data previously so I will list some of the thoughts I had around this and it would be great to have your (and anyone else's) thoughts:

Suggesting that the user subsamples one of the read pair files to half the overall desired coverage, and then taking the read IDs in the subset and grabbing those out of the other read pair. (I would obviously provide a code snippet for how to do this).
Having a --paired-end flag and ask for a single fastq that has the read pairs combined. One thing to think about would be whether I would need this file interleaved or not. I guess that would depend on the implementation I go with. Would interleave vs. non-interleave be a significant issue for users?

Please also throw-in any other ideas you have.

Answer 2 · 2019-11-26T02:35:57.000Z

Ideally you would take 4 parameters like --in1 --in2 --out1 --out2 because not enough tools support interleaved format.

But for elegance i suspect you don't want that, then you could just take interleaved PE (easily created with <(seqtk mergepe R1.fq.gz R2.fq.gz). The output is usually demerged with two steps (unfortunately) using seqtk seq -1 out.fq | gzip > R1.fq.gz and then with -2 for R2. I need to find a tool that can write both from an input stream to two files.

Answer 3 · 2019-11-26T11:02:17.000Z

Ok. I will have a think about how much to achieve this then. I am heading on holidays for 5 weeks this weekend though so it is unlikely I will get around to it before then.

Regarding demerging in one step, pyfastaq can do this.

So I could envision something like this in the future

rasusa -i interleaved.fq -g 4g -c 20 | fastaq deinterleave - r1.fq r2.fq

Answer 4 · 2020-03-25T18:29:42.000Z

Alright. I am finally revisiting this. Sorry for the massive delay. Nothing like COVID-19 to make you revisit all the stuff you should have done ages ago.

On reflection, the interleaved file method is not ideal. I think one of two ideas should work

Allow --input/-i to be passed twice
Switch --input to be a positional parameter instead and allow passing up to two files

In both instances, if two input files are given, it will be assumed it is illumina data.
One issue I do foresee with option 2 is that it will break any pipelines that may be using rasusa so I would have to also support --input as well. The more I type the more I like the idea of option 1.

How does that sound?

Answer 5 · 2020-03-27T20:44:53.000Z

@mbhall88 drowning in COVID work too, but this looks good. thank you