mozack/abra2

crash with java IndexOutOfBoundsException

anso-sertier opened this issue · 26 comments

Hi,

I'm running ABRA2 on WGS data ([30,80]x samples) and I get this error multiple time but each time on a different region.

INFO Wed Mar 07 01:29:48 CET 2018 PROCESS_REGION_MSECS: 1_23599001_23599401 1 0 0 0
ERROR Wed Mar 07 01:29:48 CET 2018 Error parsing assembled contigs. Line: [>1_121484601_121485001_21]
[...]
java.lang.ArrayIndexOutOfBoundsException: 4 at abra.ScoredContig.convertAndFilter(ScoredContig.java:53) at abra.ReAligner.assemble(ReAligner.java:1096) at abra.ReAligner.processRegion(ReAligner.java:1262) at abra.ReAligner.processChromosomeChunk(ReAligner.java:342) at abra.ReAlignerRunnable.go(ReAlignerRunnable.java:21) at abra.AbraRunnable.run(AbraRunnable.java:20) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

I saw a previous issue describing this error but it was with version 2.9 but glk is now by default disabled.

I'm using ABRA2 version 2.14 compiled on CentOS 6.2 with jdk1.8.0_162,
ABRA2 was launched with default options (except 8 threads) and normal and tumor bam were provided.
Alignments were made with BWA aln on GRCh37
4/8 samples have already crashed with this error and the 4 remaining are running since 7 days (8 cpu + 30Go ram allocated)
Can you also tell me if the running time seems correct for WGS or if I can speed it up (by removing centromere regions for exemple ?).

Thanks in advance,

Anne-Sophie

Those runtimes seem rather long. I would not expect centromeres to be the issue as low mapq reads should not contribute to contigs.

Could you please attach an entire log file containing the error?

Unfortunately, I don't see a smoking gun cause of the Exception. If you're able to share a BAM snippet that reproduces the problem, I'd be happy to take a look.

Regarding the long runtimes, it appears that an inordinate number of assemblies are being triggered. If you have adapter contamination, you may wish to trim first. Otherwise, you may wish to try running with the --sa option to disable the full blown assembly.

Thanks a lot for your answer.
For the running time aspect, I have highly rearranged tumors. Only one sample tested ended (within 12 days) and it is the less rearranged one. I have between 1 and 6 Million raw positions with soft-clipped reads in tumors and less than 2 millions in paired normal tumors. This can perhaps explain the inordinate number of assemblies. QC do not show any adapter contamination.
I'm now testing with --sa option. However I was wondering how this option affect ABRA process as ABRA do assembly. Can you give me some clues about this option ?
Thanks a lot again

Unlike the original ABRA, ABRA2 does not rely exclusively on assembly. In our testing we see good results even with assembly disabled. Sensitivity for longer inserts will be impacted, but the overall results should still be good.

I have plans to investigate making the assembly triggers less frequent without negatively impacting sensitivity, however it may take me some time to get to this. In the meantime, the --sa option should be a reasonable workaround. Please let me know if you continue to run into problems.

Hi,

I am also having occasional crashes with a "java.lang.ArrayIndexOutOfBoundsException", on 13 out of 210 RNA-seq BAM files on which I launched ABRA2.

The command line I used is:
java -Xmx40g -jar abra2-2.14.jar --in XX.bam --out "XX_abra.bam" --ref ref_genome.fa --tmpdir . --threads 4 --index --junctions STAR.XX.SJ.out.tab --gtf ref_annot.gtf --sua --dist 500000

The crash is:

ERROR	Fri Apr 20 00:01:56 CEST 2018	Read buffer: [
[...]
]
java.lang.ArrayIndexOutOfBoundsException: 4
	at abra.ScoredContig.convertAndFilter(ScoredContig.java:53)
	at abra.ReAligner.assemble(ReAligner.java:1096)
	at abra.ReAligner.processRegion(ReAligner.java:1262)
	at abra.ReAligner.processChromosomeChunk(ReAligner.java:342)
	at abra.ReAlignerRunnable.go(ReAlignerRunnable.java:21)
	at abra.AbraRunnable.run(AbraRunnable.java:20)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Here is the complete log:
XX_abra.log.zip

Could it be a problem with input BAM files? I will try re-aligning them just in case (I used reads trimming with cutadapt, mapping with STAR 2 pass, and sorting with sambamba).

Thanks!

In both our cases, it looks like what is causing the error is that there is a contig assembled by ABRA2 that was not stored properly. Instead of having a contigString in the form
>chr1_59247401_59247801_1_score
AACAG...

it looks like
>chr1_59247401_59
(The first line is cut; in @anso-sertier 's case, it is "1_121484601_121485001_21" which is cut just before the score; in my case it is even weirder because it is cut within the region name)

Thus ABRA function convertAndFilter cannot parse it correctly, and instead of having the score as element 4 of the line starting with '>', it gets nothing and returns an ArrayIndexOutOfBoundsException for element 4.

I guess a quick and dirty fix would be to have an option "ignore_parsing_errors" that would just ignore such weird contigs (in my case, it is contig 173 of the region that causes the crash, the previous ones were fine) and just put something in the log. It seems this kind of event is pretty rare so this should not throw away a lot of data.

I am trying to get a small reproducible example but so far cutting the BAM around the region leading to the error has actually avoided the error entirely...

Thanks for the input. I'll look into adding this option shortly.

Thanks!

A more permanent solution would be great, but I could not get a reproducible example to help you... Actually, the error occurs on a different chromosome at each attempt, so the whole randomness of it makes the search for a small reproducible example illusory. I imagine that these errors have to do with something going wrong during the contig_str build, possibly here

void output_contig(struct contig* contig, int& contig_count, const char* prefix, char* contigs) {

Judging that it is sort of random, could it have to do with the memory usage?

As an alternative workaround, I have split the BAM file into chromosomes and run ABRA2 separately on each one of them, relaunching the ones that fail. I saw on a closed issue that you said it should be fine because the ABRA2's parallelization is at the megabase scale, so different chromosomes should be treated independently anyway. Am I assuming correctly? What happens in case of a read split between 2 chromosomes (e.g., due to a rearrangement)? Is this different in the case of RNA-seq (although STAR junctions seem to be within-chromosome)?

Option--ignore-bad-assembly has been introduced in release 2.15.

Note that I have not tested this as I have not recently encountered this issue. Feedback is appreciated.

Judging that it is sort of random, could it have to do with the memory usage?

This is entirely possible and I will be looking into it.

I saw on a closed issue that you said it should be fine because the ABRA2's parallelization is at the megabase scale, so different chromosomes should be treated independently anyway. Am I assuming correctly?

Yes, this is correct.

What happens in case of a read split between 2 chromosomes (e.g., due to a rearrangement)?

ABRA2 does not currently attempt to deal with structural variants. It is possible though, that the non-clipped portion of the read may be realigned if there are additional variants near the breakpoint.

Is this different in the case of RNA-seq (although STAR junctions seem to be within-chromosome)?

No

I have data that reproduces this problem now and it is clearly memory corruption. I am testing a fix and hope to have a new release including the fix available in the next few days. Thanks for your patience.

mfoll commented

👍

Release 2.17 should resolve this. Please let me know if you continue to see issues.

Closing this. Please feel free to re-open if the problem happens again.

Hi @mozack , I encountered the same memory corruption problem even with the latest version (v2.22).

Please share some details about your failed run.

i.e. what is the command line you are running? what kind of data are you running on? Does the problem happen consistently for you across samples? Are you running with gkl enabled? and any other details particular to your run and compute environment.

@mozack I run it with snakemake in conda environment and on HPC.
The command line is: abra2 --in {input} --out {output} --ref {params.ref} --threads {threads} --tmpdir {params.dir} > abra_{type}.log
Yes, the problem happens to all my WGS samples every time I run it. But it's fine with small samples.

Please zip and attach a log file.

The abra.log file is actually empty.

13512281.out.gz
@mozack This is the stdout file I got

How did you acquire this version of abra?

I acquired it though conda install

conda install -c genomedk abra2

Can I do realignment for each chromosome separately? If this could be a way to solve the memory problem. @mozack

This isn't related to the original issue. The default java heap size used in Conda is too small for your runs and you will need to allocate more RAM. I have not used the Conda installation myself (someone else set this up), but it looks like you can specify more RAM by setting the JAVA_TOOL_OPTIONS environment variable.

i.e.
export JAVA_TOOL_OPTIONS="-Xmx32G"

You may need to experiment to determine the optimal amount of RAM for your samples.

If you continue to have trouble, please open a distinct issue.

Thank you very much.