Performance issue in readSam() interacts badly with NFS

Question

Performance issue in readSam() interacts badly with NFS

idavi-bcs opened this issue a year ago · 2 comments

In my environment, all files are stored on an NFS volume (AWS EFS, specifically). This seems to cause a certain amount of performance degradation, because AnchorWave loads genome sequences from disk on demand rather than caching them in memory, and so the code is frequently waiting on a read operation against the network file system.

However, the problem becomes severe when running the proali command, because readSam() reads sequence one base at a time from disk while reading the query SAM file (see https://github.com/baoxingsong/AnchorWave/blob/master/src/service/TransferGffWithNucmerResult.cpp#L124). In my environment, reading the query SAM file for a typical corn genome takes approximately 40 hours, during which time only 5-10% of a single CPU is utilized, due to excessive I/O wait times. Most of this is spent in repeated calls to getCharByPos() while processing Match codes in the CIGAR string.

I have found that replacing the repeated calls to getCharByPos() with a single call to getSubsequence2(queryGenome, queryChr, currentqueryPosition, currentqueryPosition + cLen - 1), the I/O bottleneck is eliminated and reading the same SAM files takes only about 25 minutes. In my testing, this appears to generate the same sequence as getCharByPos() and causes no meaningful increase in memory usage.

If I've understood the issue correctly, I would suggest:

patching the readSam() function to use getSubsequence2() instead of readCharByPos()
updating the docs to warn users to avoid using files on a network file system if possible, especially genome FASTAs

Answer 1 · 2023-05-17T01:21:44.000Z

Thank you so much Dr. Davis for this insightful message.
I do not have a lot of experiments with network storage. By using FASTA files locally, reading a corn SAM file only takes a few minutes or less than a minute.
To generate alignment with good accuracy, we tuned the base-paired resolved dynamic programming algorithm using a large windows size, which is 100 Kbp by default. So dynamic programming part is costing a high memory, thus the memory is a bottleneck. To increase the number of threads that could run in parallel, we tried to save memory everywhere. That is why AnchorWave loads genome sequences from disk on demand rather than caching them in memory.

I would like to update the docs to warn users to avoid using files on a network file system. If you or your colleagues do not have enough local storage but have a large memory, the alternative way could copy the FASTA file to the memory /dev/shm.

Answer 2 · 2023-05-17T15:21:28.000Z

Dr. Song, thank you so much for the amazingly rapid response! I see you've already updated the docs.

Your suggestions are good ones. Alas, the environment that I work in does not have fast local disks, and I do not have sufficient permissions to increase the size of /dev/shm. I believe that many researchers will find themselves in the same situation, depending on the details of the university cluster they use or the cloud environments they work in.

Given that, I hope you will consider merging my pull request #50. For me at least, it makes the difference between AnchorWave being usable and unusable. It should improve performance for everyone. And because it comes before the memory-intensive alignment portion of the algorithm, it will not increase peak memory utilization at all. Please do let me know if you see some other drawback to the PR that I didn't notice!