mozack/abra2

Expected memory requirements

ionox0 opened this issue · 7 comments

We have been using 60GB of ram for ABRA2 on deeply-sequenced bams (20,000x total, 1,200x unique, ~10 GB in compressed size).

We would like to be able to realign as many bams together as possible, and have made attempts with 4-6 of these deeply-sequenced bams together, which has caused some memory issues. Is it reasonable to expect memory to scale linearly with the number of bams to be realigned? If there are any benchmarking results for such tests we would be interested to see them.

Thank you kindly

That's good question and unfortunately I do not have a direct answer for you as I have not attempted to run ABRA2 at that depth or with that number of BAMs. We see depths up to 2000x and have not gone beyond 3 input BAMs at a time.

Can you fill me in a bit more on what you are trying to do? Are the co-realigned samples related (i.e. from the same patient over different time points)? Is there a minimum VAF you are attempting to detect? Is this amplicon sequencing?

Thank you kindly for the information. To answer your questions, yes these are different time points, we are attempting 0.1% VAF for detection, and yes it is amplicon sequencing. Will let you know if we are able to have success with 4+ BAMs. We have previously had to use up to 120G and with much longer runtimes (up to ~10 hours), which is not sustainable for our purposes. Edit: 120G reserved, but haven't checked how much was actually used.

OK. For amplicon sequencing I suggest disabling assembly and using consensus sequence for detection of indels from soft clipped reads. Params:
--sa --cons

If your reads are noisy, you may wish to experiment with the mismatch rate for mapping reads back to contigs:
--mmr 0.1

Also, average read depth of >1000 for a region is downsampled and this is controlled with param:
--mad

If you have example data that you are able to share, I may be able to assist in optimizing for both detection and also computational burden.

Hi Lisle!

I am also having issues with running out of memory after a few hours. I am running on three human exomes with:

java -jar abra2-master/temp/abra-0.94c.jar --in cancer1.bam,normal.bam,cancer2.bam
    --out cancer1_abra.bam,normal_abra.bam,cancer2_abra.bam
    --ref resources/reference/hg38/hg38.fa
    --targets hg38exons.bed --threads 5 --working tempABRA
    --no-debug 1> testRun.out.log 2> testRun.err.log

The machine had around 100GB memory available. What is expected from a single exome, or three matched exomes?

None of the options you mention seem to be present in the version I'm running (0.94c), so not sure how I can decrease memory use. Would it help if I give you the logs? Will decreasing threads also decrease memory use?

Thanks, would be awesome to get this running!
/Christoffer

Hi, Please specify the -Xmx param to define the java heap size as shown in the sample commands.

i.e.
java -Xmx32G -jar abra2.jar ...

The optimal memory size will vary based upon your data. 32GB is a good starting point if you are jointly realigning 3 exomes. Please keep in mind that the actual memory usage will be slightly higher than the size specified via the -Xmx option.

Also, release 0.94c is several years old (pre-ABRA2). Please run using a recent release and let me know if you continue to have problems. The latest release is v2.22. See: https://github.com/mozack/abra2/releases

Eww, i just saw a .jar and used that one, didn't realise it was super old, sorry about that. 😬

I ran with the latest release, and it works fine now, correctly picking up two ITDs (4bps and 93bps) in my test case! Thanks, great tool! Now plugging into downstream variant calling!

R2fn commented

hi! How can I monitor heap memory and other jmv params for abra2 via jmx?