dotnetbio/bio

Speed bottleneck: Avoid re-parsing bam header when reading many intervals

RoboStephen opened this issue · 4 comments

I've got come code which parses many intervals from a bam file, with calls to the ParseRange() method. The results are correct, but slower than expected.

Reading through the source code (and running the profiler), it looks like the slowdown came from time spent re-parsing the bam header and index in each each call to ParseRange(). Most of my time was spent in BamParser.GetHeader() or BamIndexStorage.Read(). For the particular use case I'm working on, the overhead of re-reading and re-parsing is high. (My bam file has >10000 contigs. I'm parsing several thousand intervals. There are ~100 aligned reads for a typical interval, though some have far higher coverage).

I'd like to speed this up, preferably in a non-hacky way. I've got a change to my copy of the code that does the trick for my use case. The parser caches its I added an optional "useCaching" flag to the ParseRange() method (false by default). If the flag is set to true, the parser re-use the existing SAMAlignmentHeader and BAMIndex objects (if they exist - if they're null we parse as usual). With the current patch it's up to the caller to avoid setting this flag to true when switching between bam files, though that could be an easy sanity-check to add. (And of course, this caching would not be appropriate when parsing a bam file that has been updated on disk in between one ParseRange() call and the next). I can polish that up for a pull request if that sounds like a sane way to proceed and not too much of a corner case to worry about.

Sounds good! What's the best way to share the code changes? I tried submitting a pull request, but I don't think I have the permissions to push my local branch (cacheBamHeader) to github. For now, let me upload the diffs for the two affected files - that's better than nothing!

Diff.BAMParser.txt
Diff.BAMParserExtensions.txt

Also....it's a nitpick, but there's one L too many in the file name DotNetBio-Fulll.sln (should be Full instead of Fulll). As long as we're contemplating changes we could fix that too.

Update: Here are diffs for a new-and-improved fix (all tests pass, and a fix for issue #41 is incorporated as well)
MyDiffs.txt