ExpressionAnalysis/ea-utils

fastq-mcf: Allow merging multiple FASTQ input files to one (or two in case of paired-end) cleaned fastq file

Closed this issue · 4 comments

It would be nice if fastq-mcf could write multiple input FASTQ files to one 
cleaned file (single-end sequencing) or to only two output FASTQ files (forward 
and reverse), so it is possible to clean FASTQ files from the same sample 
located in multiple lanes at once and only have one (or two) FASTQ files, so 
you don't need to create a merged cleaned FASTQ file afterwards yourself.


Also is this TODO still valid? http://code.google.com/p/ea-utils/wiki/FastqMcf

====================
When discarding one read for being "too short", it has to discard both pairs. 
For a sequencing run of normal quality this is not an issue. It should, though, 
write "un-mated" reads (whose mate was skipped) to a separate file. Typically, 
since these read mates were poor quality, it's not really useful... but it can 
be for diagnostics. I've seen runs where these provide valuable data.
====================

When cleaning paired-end reads, the cleaned forward and reverse FASTQ files 
seem to have the same number of reads.


It would also be nice to have the number of reads in the cleaned files itself 
in the statistics info:
 Kept reads = ((Total reads) - (Too short after clip))

Files: 2
Total reads: 46911033
Too short after clip: 1420553
Clipped 'end' reads (RNAseq.R1.fastq.gz): Count 1369131, Mean: 23.85, Sd: 15.35
Trimmed 5391478 reads (RNAseq.R1.fastq.gz) by an average of 17.01 bases on 
quality < 15
Clipped 'end' reads (RNAseq.R2.fastq.gz): Count 1113190, Mean: 16.53, Sd: 8.83
Trimmed 9609085 reads (RNAseq.R2.fastq.gz) by an average of 26.38 bases on 
quality < 15


Original issue reported on code.google.com by hulselma...@gmail.com on 9 Oct 2013 at 1:21

If you have, say, 2 fastq's , you can do this:

rm fif; mkfifo fif;
( gunzip -c fq1.gz >> fif;  gunzip -c fq1.gz >> fif ) &
fastq-mcf adap.txt fif > out

Unlike much bioinformatics software, fastq-mcf is "stream friendly" .... 
meaning it buffers what it needs during initial sampling steps.


Original comment by earone...@gmail.com on 9 Oct 2013 at 2:41

Original comment by earone...@gmail.com on 9 Oct 2013 at 2:41

  • Changed state: WontFix
Thanks.

Thanks for this very useful tip.

It can also be done without explicitly making named pipes, but by using process 
substitution:
  http://tldp.org/LDP/abs/html/process-sub.html

fastq-mcf \
  -o cleaned.R1.fq.gz \
  -o cleaned.R2.fq.gz \
  adapters.fa \
  <(zcat uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \
  <(zcat uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;)



fastq-mcf \
  -o cleaned.R1.fq.gz \
  -o cleaned.R2.fq.gz \
  adapters.fa \
  <(gunzip -c uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \
  <(gunzip -c uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;)


Can this info being added to the wiki?
http://code.google.com/p/ea-utils/wiki/FastqMcf


Thanks

Original comment by hulselma...@gmail.com on 10 Oct 2013 at 8:05

Cool!  I didn't even know how process substitution worked.   

Original comment by earone...@gmail.com on 10 Oct 2013 at 2:24