dincarnato/RNAFramework

rf-count multiprocessing

Closed this issue · 16 comments

I wonder if the multiprocessing for rf-count is working properly.

I tried to run rf-count with either -p 20 or -p 4 -wt 5, but I didn't see it using 20 cores. It took about 6 hours to go through ~100 million mapped reads in a pre-sorted bam file.

Is this normal?

Hi coffeebond,

the -wt option in rf-count only affects BAM file sorting.
Unfortunately, right now, reading and processing of the BAM file is done on a single thread. Bringing it to multithread would need a complete rewrite, something we have planned to do, but that won't happen any time soon.
I'm sorry.
Best,

Danny

I see. I thought it uses different cores for examining the alignments and finding mismatches.

Thanks for the explanation.

Not a present, but it will in the future.
All the other modules are very well optimized for multithreading, rf-count is the one that still needs a speed up.

One last note. A possible workaround would be to split your file into 20 chunks (I assume you have 20 processors), process them in parallel with rf-count (-p 20), then merge the resulting RC files with rf-rctools merge.

Interesting thought, I guess that's something I can do to make this step much faster.

Sorry to re-open this issue.

I assume the "rf-count" still cannot use multi-processing to speed up the process. I followed your suggestions by splitting the bam files into 40 chunks and using "rf-count" to process each chunk.

However, merging 40 rc files together with "rf-rctools merge" also takes very very long time. Do you know if this is normal? Each of my rc files is about 200MB. The merged rc file only accumulates to 100KB after 15 min. Assuming the final rc file has a similar size, this will take 500 hours to finish...

Unfortunately not. I do not have much bandwidth to implement that at the moment, but it's definitely on my todo list.
I am not sure about why on your system this is so slow. 200MB RC files are quite large though... what's in those files? How many transcripts? The bottleneck is, unfortunately, the speed of disk read/write.

I just tried on my system, with 40 RC files of 40MB. Merging took less than 5 min (it's writing something like 150kb/sec). It looks like indeed the problem might be with your drive read/write speed.

Thanks for checking.

I can double-check my code but I highly doubt it's my drive read/write speed.

Do you think if it has something to do with the number of entries? I have ~500000 chromosomes/contigs in my reference. Each rc file has ~1.4 million lines.

If I use "rf-rctools view" for each file and merge count data for all 40 rc files, it takes about 1 hour to complete. Each file takes about 70~80 seconds to process.

500,000 entries are really a lot. So the problem is that you have to do 500,000*40 reading operations and 500,000 write, so that's the bottleneck.

I have a possible workaround in mind, but it would require you to rerun rf-count.

I will work on this tomorrow and update you asap.

Hi @coffeebond,

I was able to recreate your scenario. You mentioned 200MB RC files, with 500,000 entries.
It looks like, to have such a file, you have, on average, sequences that are 50bp long. Is that correct?

Indeed, the issue was the very high number of seek() operations. I have now made this process significantly more efficient. There is still a lot of read/write involved, but now 40 files are merged in ~30min.

Can you please git pull and let me know if this solves?

Best,
Danny

Hi Danny,

I just pulled.

Did you change the command syntax? When I used the same command as it was yesterday, I got the error [!] Error: Provided RCI index file does not exist. The files do surely exist. The RCI files are still separated by comma with no space, right?

Hi @coffeebond,

yes, in order to make this more efficient, now the program expects RC files with the same exact structure. Therefore, no index file is required anymore.
I assume this should be the case for your files as they were all generated using the same reference, so they should all be identical in structure.
Just remove the -i parameter and try again.

Danny

Hi Danny,

I tried that and it worked. It took about 25 min to finish the merge job.

Thanks!

Glad we fixed this, and thanks for reporting!