mbhall88/rasusa

Multi-threading approach

Teklu67 opened this issue · 8 comments

Hi,
This is a very useful program but it is taking long time to sub-sample from a large fastq file. I am running it on a server and would like to run it using multi-threading but I am novice to programming and not sure how to do that. Any help please?
Thanks,

Hi @Teklu67. When you say "a long time", how long are we talking? And how large is your file?

Thanks so much for the quick response. It finished sampling 30x from a fq of 690 Gb (60x coverage) in 2 days. Because I have the resources to run using several threads I thought it will finish much faster if there was an option for multi-threading. Thanks!

Wow, that's a very big fastq file! Is it compressed (e.g., gzip)?

How did you install rasusa?

Yes it is for tetraploid wheat and compressed .gz format. I installed it through conda.

Is your data Illumina?

There's not really too much I can offer in the way of speeding rasusa up sorry.

At some point I will look into whether multi-threading the IO is possible (i.e. batching reads).

I'll leave this open and add it to my list of things to investigate in the coming months. Sorry, I can't do it faster, but have a lot of other research projects I am trying to juggle.

However, if you (or anyone else) would like to have a go at it, I would be very happy to receive a pull request.

It is ONT data. That is ok, thank you for your time

In the mean time, I would suggest maybe trying to split the file up into subsets, and then randomly subsample each subset.

Another suggestion: I suspect most of the runtime is (de)compressing the data. Switching to zstd instead of gzip should drastically improve time spent on decompression