multithreading doesn't work for fastq files
tolot27 opened this issue · 7 comments
It looks like the multithreading does not work for the fastq format. I startet nanostat
with -t 10
or -t 20
but always only one thread is used, as top
shows me.
Hi tolot27,
You are absolutely right, multithreaded data extraction only works for bam format. I'll adapt the README to make this clear. Do you think the data extraction (from fastq format) is too slow or could benefit from multithreading? I might look into this (and welcome pull requests) but I thought it was sufficiently fast.
Cheers,
Wouter
Yes, the data extraction from fastq files is really slow compared to the summary file. For small datasets it is not noticeable but if you have tens of thousands of reads it takes really a long time to just calculate some stats.
My workflow is to use albacore for basecalling, then use nanostat with the summary file, then nanofilt wit quality and length filter, then nanostat again, now with the fastq files. Afterwards I use porechop to trim adapter, then nanostat again and sometimes also fastqc.
I believe fastq would benefit from multithreading. Using a combination of a summary file and fastq could be also beneficial as long as the reads in the fastq file are unmodified. Then, a simple lookup of the read ID in the summary file could avoid calculations.
How long is "really a long time"? I'm not sure much can be gained by parallelizing.
For about 27,000 reads it took 2m12s and for about 139,000 reads it took 10m53.
Using summary file, analyzing about 143,000 reads takes just 2 seconds.
I assume, parallelizing fastq would be nearly linear in time.
Yes, extraction from the summary file is (obviously) much faster. Coincidentally, 10m53 is the ideal amount of time to get yourself a cup of coffee.
More seriously, there is always some overhead while parallelizing - but I'll play around with it and see if it's any good.
I've added parallel feature extraction for data in fastq format in NanoStat-0.3.0.
But on my system, going beyond 4 threads did not improve anything. This makes sense because the individual calculation on a record is not too expensive - and the master thread still has to do a lot of work.
In my limited tests, this reduced the time taken for NanoStat by half, so that's already a nice improvement. I'm interested to find out how this works for you.
Yes, for large files, >300 GB multithreading would be helpful; I have 128 threads on my AMD EPYC system and it would be great to use them...