mozack/abra2

crash when writing files

llllaaaa opened this issue · 12 comments

Hi,

I tried to use abra2 to realign trio WGS germline DNA data. The server has 118G memory, and I have set -Xmx50G. After realignment is finished, the program was crashed when writing the files. Here is the log:

INFO Fri Jun 02 21:45:29 CST 2017 Waiting on 3 queued threads.
java.lang.OutOfMemoryError: Java heap space
at java.lang.String.(Unknown Source)
at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301)
at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:288)
at htsjdk.samtools.BinaryTagCodec.readNullTerminatedString(BinaryTagCodec.java:423)
at htsjdk.samtools.BinaryTagCodec.readSingleValue(BinaryTagCodec.java:318)
at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282)
at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:313)
at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:293)
at htsjdk.samtools.SAMRecord.getAttribute(SAMRecord.java:1110)
at htsjdk.samtools.SAMRecord.getStringAttribute(SAMRecord.java:1220)
at abra.SortedSAMWriter.getOriginalReadInfo(SortedSAMWriter.java:255)
at abra.SortedSAMWriter.processChromosome(SortedSAMWriter.java:186)
at abra.SortedSAMWriter.outputFinal(SortedSAMWriter.java:132)
at abra.SortedSAMWriterRunnable.go(SortedSAMWriterRunnable.java:18)
at abra.AbraRunnable.run(AbraRunnable.java:20)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

What depth is your data? Are you able to share your input files?

The depth is about 70x per subject. The original BAM files are about 300G per subject. Sorry I'm not able to share the input files, since the files are clinical data.

OK, Thanks. At that stage of processing we cache 1-2 megabases worth of reads for sorting. It is possible that some regions of your files have too many reads in a region for this. I will expose a couple of parameters that should help with this.

Thanks. I will try the new version.

Using the new version "as is" may not help.

The --nosort option should work around the issue. However, with this option you will need to sort the output BAM files.

You might also consider experimenting with the --dist option. That option limits the distance a read's alignment start position can be moved, which in turn reduces the genomic range that needs to be cached for sorting. In my quick tests, using --dist 1000 had a negligible impact on accuracy for DNA. That would not be suitable for RNA however.

These options are available in v2.07

I tried the --nosort option, then I used picard FixMateInformation to sort and fix the mate information. However, I got the following error message:

INFO 2017-06-11 23:45:11 FixMateInformation Sorting input into queryname order.
[Mon Jun 12 07:26:29 CST 2017] picard.sam.FixMateInformation done. Elapsed time: 461.30 minutes.
Runtime.totalMemory()=12921077760
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name ST-E00203:203:HLKHNCCXX:5:1224:24566:13738, Read CIGAR M operator maps off end of reference
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:448)
at htsjdk.samtools.BAMRecord.getCigar(BAMRecord.java:252)
at htsjdk.samtools.SAMRecord.getAlignmentEnd(SAMRecord.java:603)
at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1547)
at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2054)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:795)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:781)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:751)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:569)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:543)
at picard.sam.FixMateInformation.doWork(FixMateInformation.java:178)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:104)

Can you post the problematic read? If not, please let me know if the read contains a YO tag.

Also, what aligner (and version) are you using?

Hi,

Sorry for the late reply. I was too busy these days. Here's the read:

ST-E00203:203:HLKHNCCXX:5:1224:24566:13738 163 chr9_KI270718v1_random 37831 70 5S105M16D8M31D32M = 37859 224 GTGGCGTGTGGTGCTGTATGTGTGTGATGTATGTGTAGTAGATTCATGTGTGTGGTGTGGGGGTGTGTGCACATGTGCACATACTGTGTGGTGTATGTGGTGTCTGTGGGGGTGTGTATGTGCACATGGTGTGTGTACAGTGTGTGGGGT ><><=:>>;;>???>>>>>>???????????????>>????>>????????????=?=?????<???@@@?>???>??=????>???????????>?@=?==<>==>7 YA:Z:chr9_KI270718v1_random:37301:635M16D8M31D66M MC:Z:77M16D8M31D64M1S BD:Z:JJIMQNNKJJMMIOONHMHLHHHHHHMLLHMHLHHHMMNMMKLJKLLLHHHHHHLLHHHLIIILHHHHHNOIIMMIIOOIIMIMMOIIIIMMIINIMIIMMIIOKOIIMJJJMIIIINIMIIOOIIMMMMJJJJJJONKQQKLLLMMJJM MD:Z:21G45G21G4A17^TT14A0C10T3 PG:Z:MarkDuplicates RG:Z:RG5 BI:Z:LLKNQMNKKJMNJOPPJNJMJJJJJJOMMJNJMJJJNOONOLMKLMMMJJJJJJMNJJJMJJJNJJJJJOOJJMNKKPPKKNKNOQKKKKNOKKOKNKKNOKKPMQKKNKKKOKKKKOKNKKPPKKOOOPLLLLLMQPMTSNNOOONKKO NM:i:54 YM:i:4 YO:Z:chr9_KI270718v1_random:37831:+:5S107M3I5M2D30M MQ:i:57 UQ:i:294 AS:i:91 XS:i:66 YX:i:15
ST-E00203:203:HLKHNCCXX:5:1224:24566:13738 83 chr9_KI270718v1_random 37859 67 77M16D8M31D65M = 37831 -224 TGTAGTAGATTCATGTGTGTGGTGTGGGGGTGTGTGCACATGTGCACATACTGTGTGGTGTATGTGGTGTCTGTGGGGGTGTGTATGTGCACATGGTGTGTGTACAGTGTGTGGGGTGTGTGCATGTATACATGGTGTGGGTGCAGTGTG ?>>==??@?@????>@?@???>?>?????>?>?>?????=?>???????>>?>?>??>?>???>??>?>???>?????>?>?>???>????????>?>?>?>????>?>?>????>?>?>?????=>>>>>?>?>?>???>????===>? YA:Z:chr9_KI270718v1_random:37301:635M16D8M31D66M MC:Z:5S107M3I5M2D30M BD:Z:IMNMQRNQMMPNNJJJJIIMNIIIMJJJNIIIIIOOIIMMIIOOIIMINOOIIIIMNIIMIMIHLMHHMKNHHLIIIMHHHHLHLHHNNHHLLLMHHHHHHLMHNLHHHHHLIIMHHHHHNNLLHLHHMHLLLMHIIMJNJPQQPLIIJJ MD:Z:39G21G4A10^GGGTATGTTTGCACAT8^CACAGTGTGTGTGGTGTCAGAGGTGTGCATG64 PG:Z:MarkDuplicates RG:Z:RG5 BI:Z:KNOOSSQSQPSPPMMLLLLONLLLNJJJMKKKKKPPKKNNKKPPKKNKOPQKKKKNMKKNKNKKNMKKOMQKKNJJILJJJJMJMJJOOJJMMMLJJJJJJMNJPNJJJJJMIILJJJJJOOMMJMJJNJMMMLJJJMILJPPQPLKKLL NM:i:51 YM:i:4 YO:Z:chr9_KI270718v1_random:37859:-:77M16D8M31D64M1S MQ:i:60 UQ:i:121 AS:i:75 XS:i:60 YX:i:51

BWA MEM was used as aligner, then the BAM file passed through the GATK realignment and BQSR pipeline.

Thanks. It looks like the second read is indeed mapping just beyond the chromosome end. I will get this addressed in a future release and also ignore the random / unplaced / decoy contigs by default.

I know BWA has had this same behavior in some versions as well. I do not know if the current version has it or not. In the meantime, you might wish to modify Picard's VALIDATION_STRINGENCY to not error on these, or IGNORE this error.

You might also consider biobambam's bamfixmateinformation which should run much faster than Picard:
https://github.com/gt1/biobambam

Thanks for your advice. BTW, I used BWA 0.7.15. It does not have this behavior.

Disk backed sort is now implemented and stable as of v2.11. Reads are no longer mapped beyond chromosome end. Mate information should be correct as long as sort is not disabled.