zheminzhou/PEPPAN

Neighborhood based paralog splitting does not finish

marade opened this issue · 2 comments

For ~200 ~6Mb bacteria genomes, the neighborhood based paralog splitting step alone is taking over 24 hours on a c5.2xlarge EC2 instance, while the previous steps finished in a timely fashion. Notably the CPU usage for the entire period is very low (less than 1%), while memory usage remains fairly constant at 40%, indicating some sort of CPU bottleneck.

Hi, thank you for the report. This is certainly much much slower than my tests. According to your text, this is most likely to have a bottleneck in the I/O.

PEPPA writes and reads lots of data from the file system. This does not seem to be an issue in my test, even when I used a mounted netdrive. But I have not tested it in an AWS instance yet. I have updated PEPPA a little bit to optimize its I/O performance. However, please do not expect too much.

Thanks, I appreciate the prompt support. Perhaps you could add some sort of debugging capability so that the issue can be isolated? I'm not eager to run something for hours and not get an answer.