mozack/abra2

crash with java IndexOutOfBoundsException

wongs2 opened this issue · 11 comments

Tried several input configuration but keep getting stuck at this error:

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at abra.AltContigGenerator.getAltContigs(AltContigGenerator.java:273)
at abra.ReAligner.processRegion(ReAligner.java:1222)
at abra.ReAligner.processChromosomeChunk(ReAligner.java:339)
at abra.ReAlignerRunnable.go(ReAlignerRunnable.java:21)
at abra.AbraRunnable.run(AbraRunnable.java:20)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Please provide a bit more info about your dataset and email the full log to lmose at unc dot edu.

If you're able to share a small bam file that reproduces the issue, that would be helpful.

This is run as follows, for human genome build 38 + alt etc., whole genome DNA paired reads 150bp at 30x coverage. Input is reads mapped to chr1 and target is exon region in chr1. Will fill in more info when rerun with full logging information. Mapping is done with BWA mem.

java -Xmx10G -jar abra2.jar --in chr1.bam --out chr1-abra.bam --ref hs38DH.fasta --targets exon.bed --tmpdir tmp --log error --threads 6 > abra.log

Did some investigation with --log debug. After several testing, it seems that the fault lies with this segment in chr1 that is included in the target file. Removing this segment from the target file runs without the error.

chr1:30519165-30519566
TGATGATGATGGAGAGGATGCTGATGGGAAAGATGATGATGATGGAAAAGATGAGGAGGA
TGGTGATGATGAACAGGATAATGATGACGATAATGATGGAGAGGATGATGATGATGATGG
TGGTGATGGAGAGAATGATGACAAGGATGGGGATAATGGTGATGATGATGGTGGAGAAGA
TGATGATAAAGAGGATGATGATGGAGAGAATGATGATGAAGGAGAGAATGATGATGAACA
TGATGATGGAGATAATGATGATGGAAAGGATGATGATGGAGGTGATGATGACAGAGAGGA
TGATGACGATGATGATGGAAAAAATGATGATGATGAAGAAGATACTGATTATGGGGAAGA
TTATGATGATGGAGAGGATTATGATGGAGAAAATGATGTGAT

Thanks for investigating. Are you able to share a BAM file containing the reads that overlap that region?

Thanks for sending. Unfortunately, I could not reproduce the issue here.

Regarding speed, we definitely see a speedup with ABRA2. In our most recent exome test for example, the timings are:

abra: 6922 seconds
abra2: 2861 seconds

The original ABRA implementation could not scale to WGS (with realignments happening over the entire genome).

ABRA2 also sorts the final output. For cases where the fraction of reads being realigned is much smaller than the total number of reads, I could see ABRA2 potentially running slower than ABRA because of this. Running only exonic regions against WGS may fit this category. You'd need to try running ABRA2 with the --nosort option to get an apples to apples comparison.

Lastly, ABRA2 parallelizes in 25 megabase chunks. ABRA's parallelization was much more fine grained. If you're processing only a single chromosome, the original ABRA may achieve better parallelization.

Thanks. Will try out the new release 2.09.

One question is I am parallelizing the compute for WGS by performing ABRA2 on individual chr bam file. There will be discordant reads (the other end of paired reads mapped to other chr or unmapped) as such whose other end of read is absent from the bam file. Will this be an issue for ABRA2.

OK. How are you splitting the BAM files? Also, how did you generate the small BAM file you emailed?

Using sambamba view with chr or the chr with range position.

I was conveniently using a bam file previously proceeded by ABRA as input to ABRA2 to test for the above. This might have something to do with the error although the reason is unclear to me.

Testing with another older bam direct output from BWA seem to work fine. Yes, the speed up is significant. Even targeting whole chr takes decent time. Bravo!

Thanks for the feedback. FYI, a single read in the BAM file you sent had the read paired flag unset. I do not not know if sambamba view alters the bit flags or not. Samtools view does not. At present, all reads must be paired in order for ABRA2 to work properly. As long is the reads are not modified, I do not see a problem with processing by chromosome. I have yet to test this myself however.

Closing. Please re-open if you still see an issue.