Number of COBS threads must be proportional to the RAM usage

Question

Number of COBS threads must be proportional to the RAM usage

leoisl opened this issue 2 years ago · 1 comments

In a Linux machine with 6 cores and 16 GB of RAM, we get to 55% of the match part (171 of 309 steps (55%) done) in 1h 12 mins. The remaining 45% steps require ~6h. This with load_complete: True. The main limitation is RAM. After these 55% are done, we are left with very large COBS indexes (ranging from 7 to 10.5 GB + 1.5GB if loading from RAM), which means that in a 16-GB machine, which is the common configuration, these will be executed serially, with no parallelism at all.

I tried to use smaller batches (see #203), and managed to get the runtime from 7.85h to 6.5h but this is definitely not the speed up I was expecting.

Another approach is to actually make the number of COBS threads vary with the RAM. The main idea is: if there is a job using 80% of the RAM given to the pipeline, it must use 80% of the cores. Like this, heavy jobs get more core, and light jobs get less cores. This will solve the issue I was experiencing of having a single COBS job running with 1 core that gets most of my RAM, while the other cores sit idle. Note that is not useful to simply set COBS threads to a high value, as this limits parallelism because even very light COBS jobs will get lots of cores.

Answer 1 · 2022-11-10T18:43:01.000Z

I'm going to try something related tomorrow (2 threads for the plasmid db). Thanks for the suggestion.