Multithread support for fastq
Closed this issue · 6 comments
Hi,
I am running Nanostat on a concatenated fastq file from 10 runs. ~ 46 gigabytes file size and despite requesting for 16 CPUs, it's only using 1. I used
/usr/bin/python3 /home/gen/.local/bin/NanoStat -t 16 --fastq pass.fastq --outdir passreport -n Ab1
It's been running for over an hour. Anyway we can get it to run faster?
Ming
Hi Ming,
That's right. The parallel part only works when feature extraction is done from multiple files.
Could you share the log file to show in which phase it spend an awful lot of time? Speed might also depend on your filesystem, perhaps.
Wouter
Hi Wouter,
There wasn't a log file generated but that could be because I didn't request it to output into one. What is the possibility of having NanoStat splitting the fastq into X smaller files (based on number of CPUs requested) and subsequently running the smaller fastq files in parallel followed by summarising the dataset as a whole in 1 summary file?
Regards,
Ming
Oh, you are right, I did not add logs to NanoStat. My bad!
It would be possible to split your fastq using unix split, and then feed those to NanoStat. For now I don't have the time or intention to implement this directly into NanoStat (or other NanoPack scripts).
Cheers,
Wouter
Hi Wouter,
I noticed substantial improvement in speed after splitting the fastq files into smaller chunks which then enabled parallel processing by NanoStat.
Regards,
Ming
Thanks for the feedback!
@MingDeakin Could you please post the method/approach that you used to do this? Or did you simply use to chop up the file and perform NanoStat on each one? If so, how did you recombine the output?