wdecoster/nanostat

Multithread support for fastq

Closed this issue · 6 comments

Hi,

I am running Nanostat on a concatenated fastq file from 10 runs. ~ 46 gigabytes file size and despite requesting for 16 CPUs, it's only using 1. I used

/usr/bin/python3 /home/gen/.local/bin/NanoStat -t 16 --fastq pass.fastq --outdir passreport -n Ab1

It's been running for over an hour. Anyway we can get it to run faster?

Ming

Hi Ming,

That's right. The parallel part only works when feature extraction is done from multiple files.
Could you share the log file to show in which phase it spend an awful lot of time? Speed might also depend on your filesystem, perhaps.

Wouter

Hi Wouter,

There wasn't a log file generated but that could be because I didn't request it to output into one. What is the possibility of having NanoStat splitting the fastq into X smaller files (based on number of CPUs requested) and subsequently running the smaller fastq files in parallel followed by summarising the dataset as a whole in 1 summary file?

Regards,
Ming

Oh, you are right, I did not add logs to NanoStat. My bad!

It would be possible to split your fastq using unix split, and then feed those to NanoStat. For now I don't have the time or intention to implement this directly into NanoStat (or other NanoPack scripts).

Cheers,
Wouter

Hi Wouter,

I noticed substantial improvement in speed after splitting the fastq files into smaller chunks which then enabled parallel processing by NanoStat.

Regards,
Ming

Thanks for the feedback!

@MingDeakin Could you please post the method/approach that you used to do this? Or did you simply use to chop up the file and perform NanoStat on each one? If so, how did you recombine the output?