HadoopGenomics/Hadoop-BAM

Optimize BAM split generation for cloud stores

Opened this issue · 3 comments

Finding BAM split boundaries is currently slow for cloud stores like S3 and GCS. The goal of this issue is to characterize the problem, and implement fixes (e.g. finding splits in parallel on the client).

Two important bits of spark-bam that deal with this, fwiw:

  1. computing splits on workers, in parallel (cf. diagrams)
  2. using a block-LRU-caching inputstream/channel abstraction

I guess another thing worth adding here is that I had to guard against unreasonably-large memory-allocations in BAMRecordCodec (at non-record-start positions where the first 4 bytes of the candidate BAM-record are arbitrary data but are interpreted as a 4-byte int, and an array of that many bytes is allocated).

Without optimizing around that, evaluating hadoop-bam's guessing-logic on all positions in a file often slowed to a crawl, seemingly in parts of files where the average 4-byte windows corresponded to large integers, which caused large bogus-sized allocations at each checked virtual-position, and resulted in memory-pressure and slowdowns.

Here's some relevant code in a BAMRecordCodec shim that I wrote for this reason.

Thanks for the info @ryan-williams! That's very helpful.