Support for paired-end reads?
tseemann opened this issue · 5 comments
Are there plans to support Illumina paired reads?
HA, I had an internal bet with myself about whether this would be the first issue raised.
I am happy to support any feature that users are likely to want/need and that stays within the scope of the project. This definitely seems to tick both of those boxes.
I haven't actually worked much with Illumina data previously so I will list some of the thoughts I had around this and it would be great to have your (and anyone else's) thoughts:
- Suggesting that the user subsamples one of the read pair files to half the overall desired coverage, and then taking the read IDs in the subset and grabbing those out of the other read pair. (I would obviously provide a code snippet for how to do this).
- Having a
--paired-end
flag and ask for a single fastq that has the read pairs combined. One thing to think about would be whether I would need this file interleaved or not. I guess that would depend on the implementation I go with. Would interleave vs. non-interleave be a significant issue for users?
Please also throw-in any other ideas you have.
Ideally you would take 4 parameters like --in1 --in2 --out1 --out2
because not enough tools support interleaved format.
But for elegance i suspect you don't want that, then you could just take interleaved PE (easily created with <(seqtk mergepe R1.fq.gz R2.fq.gz)
. The output is usually demerged with two steps (unfortunately) using seqtk seq -1 out.fq | gzip > R1.fq.gz
and then with -2
for R2
. I need to find a tool that can write both from an input stream to two files.
Ok. I will have a think about how much to achieve this then. I am heading on holidays for 5 weeks this weekend though so it is unlikely I will get around to it before then.
Regarding demerging in one step, pyfastaq can do this.
So I could envision something like this in the future
rasusa -i interleaved.fq -g 4g -c 20 | fastaq deinterleave - r1.fq r2.fq
Alright. I am finally revisiting this. Sorry for the massive delay. Nothing like COVID-19 to make you revisit all the stuff you should have done ages ago.
On reflection, the interleaved file method is not ideal. I think one of two ideas should work
- Allow
--input/-i
to be passed twice - Switch
--input
to be a positional parameter instead and allow passing up to two files
In both instances, if two input files are given, it will be assumed it is illumina data.
One issue I do foresee with option 2 is that it will break any pipelines that may be using rasusa so I would have to also support --input
as well. The more I type the more I like the idea of option 1.
How does that sound?