sequencing/NxTrim

output fastq file(without gzip) when process a fastq.gz file

Closed this issue · 13 comments

Can NxTrim output fastq file(without gzip) when the input is a fastq.gz file?

If the following process access the fastq file(without gzip), it will become faster.

Sorry, I don't think I will implement this. It is not a common use case.

If you have a program that does not read gzipped fastq, you can always create a dummy file with mkfifo:

mkfifo r1
zcat example/MP_R1.fastq.gz > r1 &
cat r1

The program such as trimmomatic/bwa that I am using can support gzipped fastq, but the performance of gzip fastq reading is worse than that of fastq file, because the gunzip process is not multiple thread support.

The fastq.gz file will save disk space, but the file of NxTrim is not the final result file, so we can delete it after it is used.

And It is not the fastq final result , because of the lacking of low qua remove function?

I am not convinced of the need to trim low quality bases. Aligners can split reads and modern assemblers use error correction as a pre-processing step. It is not clear a trimming heuristic does a better job than these sophisticated algorithms. I get very nice assemblies directly from the nxtrim output. See here:

https://github.com/sequencing/NxTrim/wiki/Bacterial-assembles-using-Nextera-Mate-pairs

As for performance, decompression is not a bottleneck for any serious bioinformatics task.

When aligning with bwa, I see negligible differences in compute time for uncompressed versus gzipped fastq:

bwa mem EcMG.fna -p EcMG1.mp.fastq.gz > /dev/null 
#40.838 seconds

zcat EcMG1.mp.fastq.gz > tmp.fastq
bwa mem EcMG.fna -p tmp.fastq > /dev/null 
#41.147 seconds

gzip will become bottleneck on very fast I/O and very large files. And it's not always the case that you directly run bwa after trimming ...

So I'd go for (optional) uncompressed output as well :-)

Yes, if you can afford to store uncompressed fastq on your SSD then this might save you some time. On the other hand, on my system, it is actually slightly slower to pull uncompressed fastq from a network disk (probably because it is i/o bound and you have to read more data).

I am not convinced, but I do take pull requests ;)

At least for the MP fraction we could use --stdout to prevent output compression. But this contains both mp and unknown libraries, correct?

That is correct.

I use --stdout to pipe the output to bwa. The aligner will then flag whether the reads were FR/RF so I can tell if reads were true mate-pairs or not ie. you don't need to rely on the presence of the Nextera adapter to tell if a read is a true MP when performing alignment.

ok, makes sense (for direct alignment).
I usually do some denovo assembly of my data (1-2Gbp genome size) and I did see differences when using mp with or without unknown data in scaffolding.

Maybe you could separate both by writing to mp to stdout and unknown to stderr?

Yes, for scaffolding you probably only want to use mp (and hence the --justmp flag). Unfortunately your proposed solution is about as complicated as implementing plain text output.

I would really like to see a realistic use case for unzipped fastq (with actual timings) before I consider implementing it. It would have to be at least twice as fast as using the gzipped input.

I am not sure if --justmp only dumps mp or additionally unknown . Your suggestion implies a mp-only output, nxtrim short help tells me --justmp - just creates a the mp/unknown libraries .. so I am a bit confused ;-)

Concerning the speed issue ... I played around a bit. Not really a fully blown benchmark but just enough to get an idea (if I am not completely wrong):

-rw-r--r-- 1 klages klages 85G 2016.03.29 16:49:27 athCun.raw.il.fq
-rw-r--r-- 1 klages klages 32G 2016.03.29 16:33:37 athCun.raw.il.fq.gz

A little perl script which simply opens the fastq files (.gz via open(my $fh1, "-|", "gzip -dc athCun.raw.il.fq.gz"), iterates over each line and counts lines.

Reading the uncompressed file is done at a rate of roughly 260MiB/sec (as seen in htop) and takes ~330sec. Compressed file is read at a rate of about 37-44MiB/sec and takes ~800sec.

I also used another tool, just for checking the reading rates, https://github.com/ADAC-UoN/fqcounter

This tool reads both fastq files at about the same rate as the simple perl script.

This has been tested on a local filesystem (xfs) of my workstation (HP Z800).

edit: as I alter the fastq-headers after nxtrim I can perfectly live with --stdout if I had the option to decide wether stdout stream consists of mp-only data or mp/unknown data.

I am not sure if --justmp only dumps mp or additionally unknown . Your suggestion implies a mp-only output, nxtrim short help tells me --justmp - just creates a the mp/unknown libraries .. so I am a bit confused ;-)

Sorry this isn't clear and the behaviour should be changed. If you run with --justmp and no --stdout you will get sample.mp.fastq.gz and sample.unknown.fastq.gz. The former being useful your scaffolding. When you add --stdout they are all mixed together. I think I should just remove the --justmp requirement to --stdout.

In your example, you are just reading the files, but my point is that if you have to process the data in some way (ie align it for scaffolding), the decompression won't be a bottleneck. I guess conceivably if you are piping it to another trimmer that is very fast, the compression will be a bottleneck.

edit: as I alter the fastq-headers after nxtrim I can perfectly live with --stdout if I had the option to decide wether stdout stream consists of mp-only data or mp/unknown data.

This makes sense for a few different reasons. I will add this.

I just wanted to show that (de)compression in general may become a bottleneck with large data volumes and fast I/O. NFS mounts cannot deliver data that fast ... that's OK. Compression is slower but may be sped up with multithreading.

I have added --stdout-mp and --stdout-un in #22 which I think largely resolves this.