Multithread support for fastq

Question

Multithread support for fastq

Closed this issue 7 years ago · 6 comments

Hi,

I am running Nanostat on a concatenated fastq file from 10 runs. ~ 46 gigabytes file size and despite requesting for 16 CPUs, it's only using 1. I used

/usr/bin/python3 /home/gen/.local/bin/NanoStat -t 16 --fastq pass.fastq --outdir passreport -n Ab1

It's been running for over an hour. Anyway we can get it to run faster?

Ming

Answer 1 · 2018-04-25T07:57:34.000Z

Hi Ming,

That's right. The parallel part only works when feature extraction is done from multiple files.
Could you share the log file to show in which phase it spend an awful lot of time? Speed might also depend on your filesystem, perhaps.

Wouter

Answer 2 · 2018-04-26T00:46:54.000Z

Hi Wouter,

There wasn't a log file generated but that could be because I didn't request it to output into one. What is the possibility of having NanoStat splitting the fastq into X smaller files (based on number of CPUs requested) and subsequently running the smaller fastq files in parallel followed by summarising the dataset as a whole in 1 summary file?

Regards,
Ming

Answer 3 · 2018-04-27T18:56:41.000Z

Oh, you are right, I did not add logs to NanoStat. My bad!

It would be possible to split your fastq using unix split, and then feed those to NanoStat. For now I don't have the time or intention to implement this directly into NanoStat (or other NanoPack scripts).

Cheers,
Wouter

Answer 4 · 2018-04-29T03:37:26.000Z

Hi Wouter,

I noticed substantial improvement in speed after splitting the fastq files into smaller chunks which then enabled parallel processing by NanoStat.

Regards,
Ming

Answer 5 · 2018-04-29T07:07:45.000Z

Thanks for the feedback!

Answer 6 · 2020-12-03T21:51:30.000Z

@MingDeakin Could you please post the method/approach that you used to do this? Or did you simply use to chop up the file and perform NanoStat on each one? If so, how did you recombine the output?