Index loading
keiranmraine opened this issue ยท 14 comments
I'm finding it takes ~25 minutes to load the various components of the indexes, without -7
this is only a couple of minutes.
The loading of the core reference files up to the following message runs at ~100% CPU:
* Reading reference genome..
* Binary seq file = /home/kr525/rds/hpc-work/data/ref/Homo_sapiens_assembly38.fasta.0123
* Reference genome size: 6434693834 bp
* Done reading reference genome !!
The section as follows runs at 5-15% CPU indicating disk-wait:
[M::memoryAllocLearned::LEARNED] Reading kmer SA index File to memory
[M::memoryAllocLearned::LEARNED] Reading ref2sa index File to memory
Is there anything obvious relating to the file reading that could account for this?
I expect unrelated, but I did notice that ref.suffixarray_uint64
can be compressed with gzip -1
for 50% reduction in size.
Decompression cost for L1 is negligible compared to the disk latency (and will be more cost effective for systems with IOPs accounting).
This is a good point. It is true it takes long to load all indexes with low disk I/O.
I assume you might be loading indexes from HDD disk, where the disk read speed seems around ~100MB/sec.
The problem is that file is just too big and it takes time to load it. When you use SSD (~500MB/s), it can be loaded within ~4 minutes.
We thought about some solutions to this problem before and below are the solutions we have until now.
1) Load only the pos_packed file (~29GB) and build possa_packed (78GB) and inverse suffix array (29GB) at startup.
- We tested this option before and it turns out the speed is similar to loading from SSD. (building at startup time is fast)
- It is in-memory operation and does not require extra memory space while doing it.
- With your current disk read speed, it might still take ~8-9 minutes (read 29GB file + build other indexes).
But it is not enabled in the current codebase.(Enabled as default index loading master branch and v1.0.4 release)
2) Rely on OS cache (OS caches files read or written, on free memory space)
- This scenario works when the user has a large memory and rerun BWA-MEME sequentially.
3) Depending on the amount of (workload, disk type) use mode2 or mode1 which can be faster.
4) Store index files in place with higher disk I/O
If you want to try solution 1), I can provide it within a few(2-3) days.
This option is enabled as default index loading in master branch, since v1.0.4.
I think (1) is the best option to allow cloud solutions to work where you have 2 factors against you:
- IOP throttling, heavy blocks of read exhaust quota
- Instances are generally destroyed on completion of a job
I'm currently running this on a university cluster, so I was expecting better disk performance than 100MB/s. At Sanger lustre was exceptionally fast, I'll follow up with UoC to find out if there's something I'm missing here.
I like to share the numbers I got with option 1.
Tested on 150MB/s hdd , Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz,
---------------------------------------
RMI models: 54 sec (~10GB)
pos_packed file: 191 sec (29GB)
Build index in-memory: 86 sec (used 32 thread)
Total: 5 min 30 sec,
Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz,
---------------------------------------
Build index in-memory: 43sec (used 48 thread)
If you want to test it right now, the code is available in dev branch
.
( you don't need to rebuild the index, just recompile the code)
# recompile the code
make -j 32
# run alignment with runtime-build,
bwa-meme mem -7 -t 32 <reference> <fasta>
At startup, bwa-meme will read the .pos_packed file and build other indexes based on it.
The in-memory index building is an alternative to reading 77GB of indices.
By the way, I found that bwa-meme
alignment speed seems to be moderately higher using mimalloc
, even with small threads, while bwa-mem2
is marginally improved.
I will share the number here soon.
This looks good, rerunning this with some sample data (2x human genome, ~5.2GB of fastq input) I now get close to the run time of bwa-mem2 (550s vs 580s). Previously the best I could do was ~170s longer (best of 5, I suspect network congestion plays a part as they range from 700s-2000s).
I will note that excluding the loading of the reference the sum of the Processed 86958 reads in 3.702 CPU sec
values indicates good performance, so the impact of mimalloc
will be interesting to compare too:
- bwa mem - 7013s (legacy)
- bwa-mem2 - 4349s
- bwa-meme - 2342s
(grep 'reads in' job.out | perl -ne 'm/(\d+\.\d{3})/;$s+=$1;END{print qq{$s\n};};'
)
The real sec
component sum:
- bwa mem - 245s (legacy)
- bwa-mem2 - 170s
- bwa-meme - 97s
(grep 'reads in' job.out | perl -ne 'm/,\s+(\d+\.\d{3})/;$s+=$1;END{print qq{$s\n};};'
)
(edited previous comment to add legacy bwa counts and fix reg-exps for longer times)
I ran some benchmarks regarding mimalloc and below are the results.
Env: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, 16 physical core (32vCPU)
Dataset: single-end ERR194147_1.fastq
Config: 32 threads
Findings:
- Mimalloc does not require much memory
- Performance can be affected even with a small number of threads (32 in this case). In another machine with 48 physical core, the performance was not degraded a lot using 48 threads.
- It would be better to use mimalloc in general.
Results | |||||||
---|---|---|---|---|---|---|---|
Alignment (without disk IO) | Mimalloc | No mimalloc | Seeding | Mimalloc | No mimalloc | ||
BWA-MEME | 2518 (sec) | 3122 | BWA-MEME | 949 (sec) | 1428 | ||
BWA-MEM2 | 3533.25 | 3829.96 | BWA-MEM2 | 1970 | 2117 | ||
Memory (GB) | Mimalloc | No mimalloc | BSW | Mimalloc | No mimalloc | ||
BWA-MEME | 156.8 (GB) | 152.5 | BWA-MEME | 1143 (sec) | 1207 | ||
BWA-MEM2 | 49.1 | 45.5 | BWA-MEM2 | 1161 | 1230 |
- max resident memory measured with /usr/bin/time -v
This looks like a great addition again. Do you have an idea when it will be possible to get this pushed out to conda?
I'm currently looking at pipelines incorporating various flavours of bwa-mem[2,e]?
and I'm pretty sure that you'll need an instance of mbuffer
between bwa-meme and samtools fixmate | samtools sort
as it's already close to flat out with bwa-mem2
on 32 threads. It will be worth noting this in the README as at the speeds being achieved it isn't simply a drop in replacement (unless no pipes are used). If people want to use with bwa-postalt.js
it's better to write to disk (lz4 compressed, gzip too slow, results in blocking) and start a lower resource job
We will update the mimalloc and readme within a few days! I'll notice it here as soon as it's updated.
That's an interesting point, thank you for sharing your tips.
My PR in Bioconda package was merged few hours ago, now BWA-MEME uses mimalloc as default (also the master branch).
I'll run some tests and give findings on streaming data into samtools etc.
1.0.4, Intel skylake, 32 threads, 200GB RAM allocated via scheduler. 30X WGS from paired fastq, bwa processing includes presence of *.fasta.alt
Real world use case:
#!/bin/bash -ue
rm -f sorttmp*
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
-R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
| samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
| samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
| samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 32 - > sorted.bam
ALG | CPU(s) | WALL(s) | REAL | %CPU | peak_rss | peak vmem |
---|---|---|---|---|---|---|
bwa | 305546.745 | 9966.03700000001 | 3h 16m 15s | 2655.9% | 43.5 GB | 48.3 GB |
mem2 | 184099.304 | 6197.36299999999 | 2h 6m 13s | 2520.8% | 60.3 GB | 78.4 GB |
meme | 137476.289 | 5272.686 | 1h 53m 43s | 2147.1% | 164 GB | 181.5 GB |
CPU(s)
andWALL(s)
use the pattern matches from comment above.REAL
is the wall time for the full pipeline above as arepeak_*
As you can see due to the larger fraction of runtime specific to the sort process as described above it is worth considering changing the process to minimise the CPU wastage, introducing additional steps in a workflow:
Map with rapid compression:
#!/bin/bash -ue
set -o pipefail
bwa-meme mem -7 -K 100000000 -t 32 \
-R '@RG\tID:1\tLB:NA24385_lib\tPL:ILLUMINA\tSM:NA24385\tPU:R1_L1' \
Homo_sapiens_assembly38.fasta NA24385-30X_1.fastq.gz NA24385-30X_2.fastq.gz \
| lz4 -c1 > mapped.sam.lz4
Correct and sort 4cpu, 8GB:
#!/bin/bash -ue
set -o pipefail
lz4 -cd mapped.sam.lz4 \
| samtools fixmate -m --output-fmt bam,level=0 -@ 1 - - \
| samtools reheader -c 'grep -v ^@SQ > tmp-head && cat Homo_sapiens_assembly38.dict tmp-head' - \
| samtools sort -m 1G --output-fmt bam,level=1 -T ./sorttmp -@ 4 - > sorted.bam
This would be essential should you want to introduce bwa-postalt.js
to the flow (consumes bwa output ~1/3 required speed). If not the initial example is most likely sufficient, ~15 min extra run time on a 32 cpu instance. for 30x is mostly negated by the overhead of additional I/O, starting a new job with lower resource requirements and the additional 40-60 minutes of run time.
Thank you for your valuable tips and suggestions! We really appreciate it.
To summarize,
- pipeline of (alignment, fixmate, reheader, sort) results in low CPU utilization (bottleneck on sorting operation).
- An alternative way is to run alignment alone (with CPU fully utilized), and get the compressed alignment outputs to the low resource executor (i.e. container with 4 CPU, 8GB). But results in longer time, because of
Disk I/O
+container startup time
.
I think the first option pipelining the whole thing
is the best option in general. I would notice this result in README within few days.
- Improving
sorting
throughput looks necessary forbwa-postalt.js
pipeline. I will look into it and maybe come up with a solution in the future.
FYI, bwa-meme has a peak efficiency of ~2500 when using 32 cores piped into bwa-meme ... | lz4 -c > mapped.sam.lz4
.
bwa-postalt.js
consumes the raw sam output from bwa.
Edit: Samtools sort have higher throughput in general, using mbuffer (same amount of buffer used for samtools sort) would resolve the write hang
issue described below.
Hi, I recently looked into Samtools Sort code and found thing that might improve the bottleneck in the sorting step.
I found there is โWrite hangโ problem when using large memory buffer for Samtools Sort pipelined with BWA-MEME.
Samtools Sort works in two phases. phase 1 is done concurrently with BWA-MEME, phase 2 is executed after BWA-MEME alignment is finished.
Phase 1. Prepare small temporary bins. For given memory buffer (set by argument -m and -@), Samtools-Sort reads input and fills in the buffer. When memory buffer is full, divide the buffer by number of threads, sort it, and each thread writes temporary file (last processed bins are kept in memory without writing to files).
Phase 2. Open all temporary bins (all files from disk and memory), merge-sort the bins using heap (repeat write output, read temp file, heapsort) in single-thread (compression is multi-threaded).
The โWrite hangโ happens in BWA-MEME, when Samtools Sort is writing temporary files (phase 1). During the writing stage in Samtools Sort, it does not read input, which makes BWA-MEME write function to hang. In particular, when large memory buffer is used in Samtools Sort (e.g. 20G= -m1G x -@ 20 option), periodically at some point, BWA-MEME waits for the Samtools Sort process to write all memory buffer to temporary files (compressing and writing 20GB memory buffer to disk).
Simple solution is to use small memory buffer (-m 40m -@10 in my machine), but this can affect the time in phase 2, where using large memory can reduce the disk I/O. However, it can still improve the overall CPU usage in certain cases.
Below is my experiment results ((32 thread BWA-MEME, 20x paired read alignment, machine with HDD ~160MB/sec disk I/O).
- using small memory buffer improves "write hang" problem. This hides
Samtools-Sort phase 1
time inside pipeline of BWA-MEME. - Large memory can benefit
Samtools Sort phase 2
.
(second) | Time taken for disk I/O in MEME and Samtools Sort Phase1 | Time taken for Samtools Sort Phase2 |
---|---|---|
meme only | 151.78 | |
meme + Sort -m 2G -@ 10 | 1195.08 | 3047 |
meme + Sort -m 40m -@ 10 | 308.26 | 2954 |
meme + Sort -m 2G -@ 20 | 1035.83 | 2276 |
meme + Sort -m 40m -@ 20 | 402.19 | 3,091 |
- used command like
bwa-meme mem -t 32 -7 ... | samtools sort -m 2G -@20 -l1 -o ~/tmp/out.bam -